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Chapter 1 

Discrete Distributions 

1.1 DISCRETE DISTRIBUTION 1 

1.1.1 DISCRETE DISTRIBUTION 

1.1.1.1 RANDOM VARIABLE OF DISCRETE TYPE 

A SAMPLE SPACE S may be difficult to describe if the elements of S are not numbers. Let discuss 
how one can use a rule by which each simple outcome of a random experiment, an element s of S, may be 
associated with a real number x. 

Definition 1.1: DEFINITION OF RANDOM VARIABLE 

1. Given a random experiment with a sample space S, a function X that assigns to each element 
s in S one and only one real number X (s) = x is called a random variable. The space of X is 
the set of real numbers {x : x = X (s) , s € S}, where s belongs to S means the element s belongs 
to the set S. 

2. It may be that the set S has elements that are themselves real numbers. In such an instance we 
could write X (s) = s so that X is the identity function and the space of X is also S. This is 
illustrated in the example below. 

Example 1.1 

Let the random experiment be the cast of a die, observing the number of spots on the side facing 
up. The sample space associated with this experiment is S = (1,2,3,4,5,6) . For each s belongs 
to S, let X (s) = s . The space of the random variable X is then {1,2,3,4,5,6}. 

If we associate a probability of 1/6 with each outcome, then, for example, P (X = 5) = 
1/6, P (2 < X < 5) = 4/6, and s belongs to S seem to be reasonable assignments, where (2 < X < 5) 
means (X = 2,3,4 or 5) and (X < 2) means (X = 1 or 2), in this example. 

We can recognize two major difficulties: 

1. In many practical situations the probabilities assigned to the event are unknown. 

2. Since there are many ways of defining a function X on S, which function do we want to use? 

1.1.1.1.1 

Let X denotes a random variable with one-dimensional space R, a subset of the real numbers. Suppose that 
the space R contains a countable number of points; that is, R contains either a finite number of points or 
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2 CHAPTER 1. DISCRETE DISTRIBUTIONS 

the points of R can be put into a one-to- one correspondence with the positive integers. Such set R is called 
a set of discrete points or simply a discrete sample space. 

Furthermore, the random variable X is called a random variable of the discrete type, and X is 
said to have a distribution of the discrete type. For a random variable X of the discrete type, the 
probability P (X = x) is frequently denoted by f(x), and is called the probability density function and 
it is abbreviated p.d.f.. 

Let f(x) be the p.d.f. of the random variable X of the discrete type, and let R be the space of X. 
Since, / (x) = P (X = x) , x belongs to R, f(x) must be positive for x belongs to R and we want all these 
probabilities to add to 1 because each P (X = x) represents the fraction of times x can be expected to occur. 
Moreover, to determine the probability associated with the event A C R , one would sum the probabilities 
of the x values in A. 

That is, we want f(x) to satisfy the properties 
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Usually let / (x) = when x ^ R and thus the domain of f(x) is the set of real numbers. When we define 
the p.d.f. of f(x) and do not say zero elsewhere, then we tacitly mean that f(x) has been defined at all x's 
in space R, and it is assumed that / (x) = elsewhere, namely, / (x) = , x ^ R. Since the probability 
P (X = x) = / (x) > when x s R and since R contains all the probabilities associated with X, R is 
sometimes referred to as the support of X as well as the space of X. 

Example 1.2 

Roll a four-sided die twice and let X equal the larger of the two outcomes if there are dif- 
ferent and the common value if they are the same. The sample space for this experiment is 
S = [(dijcfe) : d\ = 1,2,3, 4; cfe = 1,2,3,4] , where each of this 16 points has probability 1/16. 
Then P (X = 1) = P [(1, 1)] = 1/16 , P (X = 2) = P [(1, 2) , (2, 1) , (2, 2)] = 3/16 , and similarly 
P(X = 3) = 5/16 and P (X = 4) = 7/16 . That is, the p. d.f. of X can be written simply as 
f(x)=P(X = x) = ^,x = 1,2,3,4. 

We could add that / (x) = elsewhere; but if we do not, one should take f(x) to equal zero 
when x ^ R. 



1.1.1.1.2 

A better understanding of a particular probability distribution can often be obtained with a graph that 
depicts the p.d.f. of X. 

note: the graph of the p.d.f. when / (x) > , would be simply the set of points { [x, f (x)} : x e R 
}, where R is the space of X. 

Two types of graphs can be used to give a better visual appreciation of the p.d.f., namely, a bar graph and 
a probability histogram. A bar graph of the p.d.f. f(x) of the random variable X is a graph having a 
vertical line segment drawn from (x,0) to [x, f (x)} at each x in R, the space of X. If X can only assume 
integer values, a probability histogram of the p.d.f. f(x) is a graphical representation that has a rectangle 
of height f(x) and a base of length 1, centered at x, for each x € R, the space of X. 

Definition 1.2: CUMULATIVE DISTRIBUTION FUNCTION 

1. Let X be a random variable of the discrete type with space R and p.d.f. / (x) = P (X = x) , 
x e R. Now take x to be a real number and consider the set A of all points in R that are less than 
or equal to x. That is, A = (t : t < x) and t G R. 



2. Let define the function F(x) by 

F{x) = P{X<x) = Y J f{t)- (1-1) 

i.eA 

The function F(x) is called the distribution function (sometimes cumulative distribution 

function) of the discrete-type random variable X. 

Several properties of a distribution function F(x) can be listed as a consequence of the fact that proba- 
bility must be a value between and 1, inclusive: 

• < F (x) < 1 because F(x) is a probability, 

• F(x) is a nondecreasing function of x, 

• F (y) = 1 , where y is any value greater than or equal to the largest value in R; and F (z) = , where 
z is any value less than the smallest value in R; 

• If X is a random variable of the discrete type, then F(x) is a step function, and the height at a step 
at x, x € R, equals the probability P (X = x) . 

note: It is clear that the probability distribution associated with the random variable X can be 
described by either the distribution function F(x) or by the probability density function f(x). The 
function used is a matter of convenience; in most instances, f(x) is easier to use than F(x). 
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Graphical representation of the relationship between p.d.f. and c.d.f. 



f(x) 



Random Variable X 



F(x) 
1 — 




F(a)=P(X<=a> 




Figure 1.1: Area under p.d.f. curve to a equal to a value of c.d.f. curve at a point a. 



1.1.1.1.3 



Definition 1.3: MATHEMATICAL EXPECTATION 

If f(x) is the p.d.f. of the random variable X of the discrete type with space R and if the summation 



5>(a;)/(aO = I> (*)/(*) (1-2) 



xeR 



exists, then the sum is called the mathematical expectation or the expected value of the 
function u(X), and it is denoted by E [u (X)} . That is, 

E[u{X)] = Y J u{x)f{x). (1.3) 

R, 

We can think of the expected value E[u(X)} as a weighted mean of u(x), x € R, where the 
weights are the probabilities / (x) = P (X = x) . 

note: The usual definition of the mathematical expectation of u(X) requires that the sum con- 
verges absolutely; that is, J2 x eR \ u ( x ) 1/ ( x ) exists. 

There is another important observation that must be made about consistency of this definition. Certainly, 
this function u(X) of the random variable X is itself a random variable, say Y. Suppose that we find the 
p.d.f. of Y to be g(y) on the support R\ . Then E(Y) is given by the summation J2 v eR V9 (v) 
In general it is true that 

^u(x)f(x) = Y] vaiy); 

R yeRi 

that is, the same expectation is obtained by either method. 

Example 1.3 

Let X be the random variable defined by the outcome of the cast of the die. Thus the p.d.f. of X 
is 

f(x) = \, 3 = 1,2,3,4,5,6. 

In terms of the observed value x, the function is as follows 

1,3 = 1,2,3, 
u(x) = { 5,3 = 4,5, 
35, a; = 6. 
The mathematical expectation is equal to 

Example 1.4 

Let the random variable X have the p.d.f. / (3) = |, x € R, where R ={-1,0,1}. Let u (X) = X 2 . 
Then 

2>V(,> = <-l> 2 (i) + (0> 2 (i)+<l> 2 Q)4 (1.5) 

xeR x 7 x 7 x 7 

However, the support of random variable Y = X 2 is Ri = (0, 1) and 
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P(Y = 0) = P(X = 0)=± 

P (Y = 1) = P (X = -1) + P (X = 1) = \ + | = §. 
That is, 



9(y) = { 






and i?i. Hence 

S^g_R V9 (y) = (|) + 1 (§)) which illustrates the preceding observation. 
Theorem 1.1: 
When it exists, mathematical expectation E satisfies the following properties: 

1. If c is a constant, E(c)=c, 

2. If c is a constant and u is a function, E [cu (X)} = cE [u (X)], 

3. If C\ and c 2 are constants and u\ and u 2 are functions, then E \c\U\ (X) + c 2 u 2 (X)] = 
CiE [ Ul (X)} + c 2 E [«a (X)} 

Proof: 

First, we have for the proof of (1) that 

E (c) = Y. R c fi x ) = c Y. R f( x ) = c 

because J2r f i x ) = 1- 
Proof: 
Next, to prove (2), we see that 

E [cu (X)} = J2 R cu (x)f(x) = cJ2 R u (x) f (x) = cE [u (X)} . 
Proof: 
Finally, the proof of (3) is given by 

E [ciui (X) + c 2 U2 (X)] = J2 R [ciui (x) + c 2 u 2 (x)] f (x) = ^2 R ciui (x) f (x) + 

J2 R C2U 2 {x) f{x). 

By applying (2), we obtain 

E [am {X) + c 2 u 2 {X)] = ciE [in (x)] + c 2 E [u 2 (x)] . 

Property (3) can be extended to more than two terms by mathematical induction; That is, we 
have 

3'- E [E-=i c * u * ( X )] = E 1=i ^ E [«i (*)] ■ 

Because of property (3'), mathematical expectation E is called a linear or distributive operator. 

Example 1.5 

Let X have the p.d.f. / (ar) = ^ , x=l,2,3,4. 
then 

tf(*) = Ex=i*(^ = i(ra)+ 2 (ro)+3(&)+4(£)=3 

e (x 2 ) = t.u x2 m = 1 2 ih) + 2 2 m + 32 (h) + 42 (4) = io > 

and 

E[X(5- X)} = 5E {X) - E (X 2 ) = (5) (3) -10 = 5. 



1.1.2 



note: the MEAN, VARIANCE, and STANDARD DEVIATION (Section 1.3.1: The MEAN, 
VARIANCE, and STANDARD DEVIATION) 
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1.2.1 MATHEMATICAL EXPECTATION 

Definition 1.4: MATHEMATICAL EXPECTIATION 

If / (x) is the p.d.f. of the random variable X of the discrete type with space R and if the 
summation 

]T U (x)/(x) = 5>(x)/(x). (1.6) 

R x£R 

exists, then the sum is called the mathematical expectation or the expected value of the function 

u (X) , and it is denoted by E [u (x)] . That is, 

E[u(X)] = Y J u{x)f{x). (1.7) 

R 

We can think of the expected value E [u (x)] as a weighted mean of u (x) , x € R, where the weights are the 
probabilities / (x) = P (X = x). 

note: The usual definition of the mathematical expectation of u (X) requires that the sum 
converges absolutely; that is, J2 x gr \ u ( x ) \f ( x ) exists. 

There is another important observation that must be made about consistency of this definition. Certainly, 
this function u (X) of the random variable X is itself a random variable, say Y. Suppose that we find the 
p.d.f. of Y to be g (y) on the support Ri . Then, E (Y) is given by the summation XLeK V9 (2/) ■ 

In general it is true that J2r u ( x ) f ( x ) = J2 ye R 1 V9 (j/)- 

This is, the same expectation is obtained by either method. 

1.2.1.1 

Example 1.6 

Let X be the random variable defined by the outcome of the cast of the die. Thus the p.d.f. of X 
is 

/(x) = | ,x= 1,2,3,4,5,6. 

In terms of the observed value x, the function is as follows 

l,x=l,2,3, 

u{x) = { 5,x = 4,5, 

35, x = 6. 
The mathematical expectation is equal to 
ELi«W/(x) = l(|)+l(|)+l(|)+5(i)+5(i)+35(i)=l(|)+5(|)+35(i)=8. 



1.2.1.2 

Example 1.7 

Let the random variable X have the p.d.f. 
f(x) = \,x&R, 

where, R = (-1,0, 1) . Let u {X) = X 2 . Then 

E, e ^ 2 /(-) = (-i) 2 (|) + (0) 2 (i) + (i) 2 (i) = l. 

However, the support of random variable Y = X 2 is Ri = (0, 1) and 



2 This content is available online at <http://cnx.Org/content/ml3530/l.2/>. 
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P(Y = 0) = P(X = 0) = l 

P(Y = l) = P(X = -l)+P(X = l) = l + l = l 

k,y = o, 

That is, g (y) = { 3 and R x = (0, 1) . Hence 



2 ,2/=i; 



i:^(z/)=o(|)+i(|) = |, 



which illustrates the preceding observation. 



1.2.1.3 



Theorem 1.2: 

When it exists, mathematical expectation E satisfies the following properties: 

1. If c is a constant, E (c) = c, 

2. If c is a constant and u is a function, E [cu (X)} = cE [u {X)] , 

3. If C\ and c 2 are constants and u\ and u 2 are functions, then E [c\Ui (X) + c 2 u 2 (X)] 
c 1 E[u 1 (X)] + c 2 E[u 2 (X)}. 

Proof: 

First, we have for the proof of (1) that 



E(c) = J2cf(x) = cJ2f( 



x) = c, 



because J2r f ( x ) = 1- 
Proof: 

Next, to prove (2), we see that 

E [cu (X)} = Y J cu{x)f{x)=cY J u (x) f (x) = cE [u (X)} . 

R R 

Proof: 

Finally, the proof of (3) is given by 

E [ci«i {X) + c 2 u 2 (X)} = ^ l c i u i ( x ) + c 2«2 (x)] f(x) = ^2 c i"i (x) f(x)+^2 C2 " 2 ( x ) f ( x ) • 

R R R. 

By applying (2), we obtain 

E [ciwi {X) + c 2 u 2 (X)] = ciE [ui (a:)] + c 2 S [u 2 (a;)] . 

Property (3) can be extended to more than two terms by mathematical induction; that is, we 
have (3') 



E 



y^ ciUi (x) 



]T CiE [ Ui (X)} . 



Because of property (3'), mathematical expectation E is called a linear or distributive op- 
erator. 



1.2.1.4 



Example 1.8 

Let X have the p.d.f. f (x) = j^,x = 1,2,3,4, then 

and 

E [X (5 - X)} = 5E {X) - E (X 2 ) = (5) (3) -10 = 5. 



1.3 THE MEAN, VARIANCE, AND STANDARD DEVIATION 3 
1.3.1 The MEAN, VARIANCE, and STANDARD DEVIATION 

1.3.1.1 MEAN and VARIANCE 

Certain mathematical expectations are so important that they have special names. In this section we consider 
two of them: the mean and the variance. 

1.3.1.1.1 

Mean Value 

If X is a random variable with p.d.f. / (x) of the discrete type and space R=(6i, 62, 63, ■■■), then E (X) = 
J2h xf (x) = bif (61) + 62/ (62) + 63/ (63) + ... is the weighted average of the numbers belonging to R, where 
the weights are given by the p.d.f. / (x). 

We call E (X) the mean of X (or the mean of the distribution) and denote it by /i. That is, 
H = E(X). 

note: In mechanics, the weighted average of the points 61,62, 63,..- in one-dimensional space 
is called the centroid of the system. Those without the mechanics background can think of the 
centroid as being the point of balance for the system in which the weights / (61) , / (62) , / (63) , ••• 
are places upon the points bi,b 2 ,b 3 , .... 

Example 1.9 

Let X have the p.d.f. 

, s,x = 0,3, 
f{ x ) = { 



The mean of X is 

IX = E 



g , x — 1, z. 



x=o ( iUim +2 m + 3 m 3 



2 



The example below shows that if the outcomes of X are equally likely (i.e., each of the outcomes has the 
same probability), then the mean of X is the arithmetic average of these outcomes. 



3 This content is available online at <http://cnx.Org/content/ml3122/l.3/>. 
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Example 1.10 

Roll a fair die and let X denote the outcome. Thus X has the p.d.f. 

f{x) = \,x = 1,2,3,4,5,6. 
6 



Then, 



„, vs -^ ( l \ 1 + 2 + 3-4 .-)-fl 7 

E \ x ) = l^ x 



.67 6 

X — 1 



which is the arithmetic average of the first six positive integers. 



1.3.1.1.2 

Variance 

It was denoted that the mean \i = E (X) is the centroid of a system of weights of measure of the central 
location of the probability distribution of X. A measure of the dispersion or spread of a distribution 
is defined as follows: 

If u (x) = (x — fS) and E \(X — /i) exists, the variance, frequently denoted by a 2 or Var (X), of a 

random variable X of the discrete type (or variance of the distribution) is defined by 

o* = E[(X-tf]=52(x-tff(x). (1.8) 

R 

The positive square root of the variance is called the standard deviation of X and is denoted by 



a = ^Var(X) = ^ E [(X - ^f] . 



(1.9) 



Example 1.11 

Let the p.d.f. of X by defined by 



The mean of X is 



f{x) = f,*= 1,2,3. 



To find the variance and standard deviation of X we first find 

1\ „o/2\ ~/3\ 36 



E (X 2 ) = l 2 , , _ , , , , 

v 6 7 V 6 ; V 6 ; 6 



Thus the variance of X is 



o^(A'V/r=«-(in a = 5. 



and the standard deviation of X is 



11 

Example 1.12 

Let X be a random variable with mean \i x and variance a 2 . Of course, Y = aX + b, where a and 
b are constants, is a random variable, too. The mean of Y is 

Mr = E (Y) = E (aX + b) = aE{X) + b= a fix + b. 
Moreover, the variance of Y is 

a\ = E \(Y - fiyf] = E UaX + b- af i x - bf\ = E \a 2 {X - /ixf] = a 2 a 



2_2 
X- 



1.3.1.1.3 

Moments of the distribution 

Let r be a positive integer. If 

E(X r )=J2* r f(x) 

R 

exists, it is called the rth moment of the distribution about the origin. The expression moment has its 
origin in the study of mechanics. 
In addition, the expectation 

E[(X-b) r ]=^2x r f(x) 

R 

is called the rth moment of the distribution about b. For a given positive integer r. 

E[(X) r }= E[X(X -1)(X -2). ■ ■ (X -r+1)} 
is called the rth factorial moment. 

note: The second factorial moment is equal to the difference of the second and first moments: 

E [X {X - 1)] = E (X 2 ) -E{X). 

There is another formula that can be used for computing the variance that uses the second factorial moment 
and sometimes simplifies the calculations. 

First find the values of E {X) and E[X(X-1)]. Then 

a 2 = E[X (X - 1)} + E(X) - [E(X)} 2 , 

since using the distributive property of E, this becomes 

a 2 = E (X 2 ) -E{X) + E {X) - [E {X)f = E (X 2 ) - fi 2 . 



Example 1.13 

Let continue with example 4 (Example 1.12), it can be find that 

S [X(X-l)] = l(0)(i) + 2(l)Q) + 3(2)Q)=f. 



Thus 

a 2 = E [X {X - 1)] + E{X)- [E (X)f 



22 7 /7\ 2 5 
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note: Recall the empirical distribution is defined by placing the weight (probability) of 1/n on 
each of n observations x\,X2,---,x n . Then the mean of this empirical distribution is 

V^ I _ Z^i=l X i _- 
/ , %i — — •& • 

*• — ' n n 

j=i 

The symbol x represents the mean of the empirical distribution. It is seen that x is usually close in 
value to 11 = E (X); thus, when [i is unknown, x will be used to estimate /z. 

Similarly, the variance of the empirical distribution can be computed. Let v denote this variance 
so that it is equal to 






(Xi-x) - = Va:---cc 2 = - Vi? -x 2 . 

n L — ' n n * — ' 

i— 1 i— 1 i— 1 

This last statement is true because, in general, 

<r 2 = E(X 2 )-fi 2 . 

note: There is a relationship between the sample variance s 2 and variance v of the empirical 
distribution, namely s 2 = nsj (n — 1). Of course, with large n, the difference between s 2 and v is 
very small. Usually, we use s 2 to estimate a 2 when a 2 is unknown. 



1.3.2 
1.3.2.1 



note: BERNOULLI TRIALS and BINOMIAL DISTRIBUTION (Section 1.4.1: BERNOULLI 
TRIALS AND THE BINOMIAL DISTRIBUTION) 



1.4 BERNOULLI TRIALS and the BINOMIAL DISTRIBUTION 4 

1.4.1 BERNOULLI TRIALS AND THE BINOMIAL DISTRIBUTION 

A Bernoulli experiment is a random experiment, the outcome of which can be classified in but one of 
two mutually exclusive and exhaustive ways, mainly, success or failure (e.g., female or male, life or death, 
nondefective or defective). 

A sequence of Bernoulli trials occurs when a Bernoulli experiment is performed several independent 
times so that the probability of success, say, p, remains the same from trial to trial. That is, in such a 
sequence we let p denote the probability of success on each trial. In addition, frequently q = 1 — p denote 
the probability of failure; that is, we shall use q and 1 — p interchangeably. 

1.4.1.1 Bernoulli distribution 

Let X be a random variable associated with Bernoulli trial by defining it as follows: 

X(success)=l and X(failure)=0. 

That is, the two outcomes, success and failure, are denoted by one and zero, respectively. The p.d.f. 
of X can be written as 

f(x)=p x (l-p) 1 - x , (1.10) 



This content is available online at <http://cnx.Org/content/ml3123/l.3/>. 
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and we say that X has a Bernoulli distribution. The expected value of is 



1 
H = E (X) = J2 V^ 1 - P) 1 -* = (0) (1 - p) + (1) (p) = P, (1.11) 



and the variance of X is 



o- 2 = Var (X) = ^ (x - pfp x (l - p) 1 x = p 2 (1 - p) + (1 - pfp = p(l-p)= pq. (1.12) 

x=0 

It follows that the standard deviation of X is a = y/p (1 — p) = ^fpq. 

In a sequence of n Bernoulli trials, we shall let X, denote the Bernoulli random variable associated with 
the ith trial. An observed sequence of n Bernoulli trials will then be an n-tuple of zeros and ones. 

1.4.1.1.1 

Binomial Distribution 

In a sequence of Bernoulli trials we are often interested in the total number of successes and not in the 
order of their occurrence. If we let the random variable X equal the number of observed successes in n 
Bernoulli trials, the possible values of X are 0,1,2,. . .,n. If x success occur, where x = 0, 1, 2, ..., n , then n-x 
failures occur. The number of ways of selecting x positions for the x successes in the x trials is 

n \ n\ 

x J x\{n-x)V 

Since the trials are independent and since the probabilities of success and failure on each trial are, respectively, 
p and q = 1 — p , the probability of each of these ways is p x (l — p) n ~ x .. Thus the p.d.f. of X, say / (x) , is 

(n\ 

the sum of the probabilities of these mutually exclusive events; that is, 




n 



f(x)= | \Tf[\-p) n -\x = 0,1,2,..., n. 

x 

These probabilities are called binomial probabilities, and the random variable X is said to have a bino- 
mial distribution. 

Summarizing, a binomial experiment satisfies the following properties: 

1. A Bernoulli (success-failure) experiment is performed n times. 

2. The trials are independent. 

3. The probability of success on each trial is a constant p; the probability of failure is q = 1 — p . 

4. The random variable X counts the number of successes in the n trials. 

A binomial distribution will be denoted by the symbol b (n,p) and we say that the distribution of X is b (n,p) 
. The constants n and p are called the parameters of the binomial distribution, they correspond to 
the number n of independent trials and the probability p of success on each trial. Thus, if we say that the 
distribution of X is 6(12, 14) , we mean that X is the number of successes in n =12 Bernoulli trials with 
probability p = j of success on each trial. 

Example 1.14 

In the instant lottery with 20% winning tickets, if X is equal to the number of winning tickets 
among n =8 that are purchased, the probability of purchasing 2 winning tickets is 



14 CHAPTER 1. DISCRETE DISTRIBUTIONS 



f(2) = P(X = 2)=i * J (0.2) 2 (0.8) 6 = 0.2936. 

The distribution of the random variable X is 6 (8, 0.2) . 

Example 1.15 

Leghorn chickens are raised for lying eggs. If p =0.5 is the probability of female chick hatching, 
assuming independence, the probability that there are exactly 6 females out of 10 newly hatches 
chicks selected at random is 

I io \ /i\Vi n4 



, . , P(X < 6)- P{X < 5) = 0.8281 -0.6230 = 0.2051. 
. 6 / W \ 2 / 

Since 

P(X < 6) = 0.8281 

and 

P(X < 5) = 0.6230, 

which are tabularized values, the probability of at least 6 females chicks is 

10 / 10 \ /l\ z /l\ w ~ x 
El \[ 2) \2) =1"^(^< 5) = 1-0.6230 = 0.3770. 

Example 1.16 

Suppose that we are in those rare times when 65% of the American public approve of the way the 
President of The United states is handling his job. Take a random sample of n =8 Americans and 
let Y equal the number who give approval. Then the distribution of Y is 6(8,0.65) . To find 

P(Y > 6) 

note that 

P (Y > 6) = P (8 - Y < 8 - 6) = P {X < 2) , 

where 

X = 8-Y 

counts the number who disapprove. Since q = 1 — p = 0.35 equals the probability if disapproval by 
each person selected, the distribution of X is 6(8,0.35). From the tables, since 

P(X < 2) = 0.4278 

it follows that 

P{Y > 6)0.4278. 

Similarly, 

P (Y < 5) = P (8 - Y > 8 - 5) = P {X > 3) = 1 - P {X < 2) = 1 - 0.4278 = 0.5722 
and 

P (Y = 5) = P (8 - Y = 8 - 5) = P {X = 3) = P {X < 3) - P {X < 2) = 0.7064 - 0.4278 = 0.2786. 



15 
note: if n is a positive integer, then 

(a + 6)" = ^ X ] b x a n - x . 



x=0 



II 



Thus the sum of the binomial probabilities, if we use the above binomial expansion with b = p and 
a = 1 — p , is 



J2\ n U x (i-P) n - X = [(i-P)+P] n = i, 



,T = 



1.4.1.1.1.1 

A result that had to follow from the fact that / (x) is a p.d.f. We use the binomial expansion to find the 
mean and the variance of the binomial random variable X that is b (n, p) . The mean is given by 

Since the first term of this sum is equal to zero, this can be written as 

n j 



£C = 



because x/x! = 1/ (x — 1)! when x > 0. 



1.4.1.1.1.2 

To find the variance, we first determine the second factorial moment E [X (X — 1)] : 

n j 

E [X (X - 1)] = J2 * (x - 1) • p*(l - p)"^. (1.15) 

^ — ' x! n — x ! 

The first two terms in this summation equal zero; thus we find that 

n j 

After observing that x (x — 1) /x! = 1/ (x — 2)! when x > 1 . Letting fc = x — 2 , we obtain 

ra-2 . n-2 , ,. 

£ [X (X - 1)] = V — — -, -/+ 2 (1 - p) n - k -\ = n(n-l)p 2 Y } U l) - v k (l - p) n - 2 - k . 

L V U ^kl(n-k-2)r y y ' y ' P ^ k\ (n-2-k V. y y> 

x=0 y ' x=0 y ' 

Since the last summand is that of the binomial p.d.f. b(n — 2,p) , we obtain 

E[X{X - 1)] =n{n- l)p 2 . 

Thus, 

a 2 = Var {X) = E (X 2 ) - [E {X)] 2 = E [X {X - 1)] + E {X) - [E {X)f 
= n(n — 1) p 2 + np — (np) = —np 2 + np = np (1 — p) . 
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1.4.1.1.1.3 

Summarizing, 

if X is b(n,p) , we obtain 

[i = np, a 2 = np (1 — p) = npq, a = \/np (1 — p). 

note: When p is the probability of success on each trial, the expected number of successes in n 
trials is np, a result that agrees with most of our intuitions. 



1.5 GEOMETRIC DISTRIBUTION 5 

1.5.1 GEOMETRIC DISTRIBUTION 

To obtain a binomial random variable, we observed a sequence of n Bernoulli trials and counted the number 
of successes. Suppose now that we do not fix the number of Bernoulli trials in advance but instead continue 
to observe the sequence of Bernoulli trials until a certain number r, of successes occurs. The random 
variable of interest is the number of trials needed to observe the rth success. 

1.5.1.1 

Let first discuss the problem when r =1. That is, consider a sequence of Bernoulli trials with probability p 
of success. This sequence is observed until the first success occurs. Let X denot the trial number on which 
the first success occurs. 

For example, if F and S represent failure and success, respectively, and the sequence starts with 
F,F,F,S,..., then X =4. Moreover, because the trials are independent, the probability of such sequence 
is 

P(X = A) = (q) (q) (q) (p) = q 3 p = (1 - pfp. 

In general, the p.d.f. / (x) = P (X = x) , of X is given by f (x) = (1 — p) x ~ p, x = 1,2,..., because 
there must be x -1 failures before the first success that occurs on trail x. We say that X has a geometric 
distribution. 

note: for a geometric series, the sum is given by 

oo oo 

fc=0 fc=l 

when \r\ < 1. 
Thus, 

OO OO 

£/(*) = Ed -p)^= T -f— = i, 

x=l x=l V F > 

so that / (x) does satisfy the properties of a p.d.f.. 

From the sum of geometric series we also note that, when k is an integer, 

k 



(1 ~P) P 



p(x> k )= e (i- P r 1 p= ;_ p, _ p ={i- P) k = q\ 



x=k+l ^' 



5 This content is available online at <http://cnx.Org/content/ml3124/l.3/>. 
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and thus the value of the distribution function at a positive integer k is 

oo 
P(X<k)= Y, {1-P) X ~ 1 P= l-P(X >k) = l-(l-pf = l-q k . 

x=k+l 

Example 1.17 

Some biology students were checking the eye color for a large number of fruit flies. For the 
individual fly, suppose that the probability of white eyes is 14 and the probability of red eyes is 34 
, and that we may treat these flies as independent Bernoulli trials. The probability that at least 
four flies have to be checked for eye color to observe a white-eyed fly is given by 



P (X > 4) = P (X > 3) = q 3 



0.422. 



The probability that at most four flies have to be checked for eye color to observe a white-eyed 
fly is given by 



P (X < 4) = 1 - g 4 = 1 



0.684. 



The probability that the first fly with white eyes is the fourth fly that is checked is 



P(.Y=4)=r/'-V={|) Q I = O.iO-.. 



It is also true that 



P (X = 4) = P {X < 4) - P {X < 3) 
In general, 

f(x) = P(X = x) = 



3\ x l (\ 



, x — 1,2,6,. 



1.5.1.2 

To find a mean and variance for the geometric distribution, let use the following results about the sum and 
the first and second derivatives of a geometric series. For — 1 < r < 1 , let 



g (r) = Y^ ar k = ^~ 



k=0 



Then 



'(r) = Vafcr fe - 1 = — 

r-f (i 



fe=i 



(l-r)' 



and 



7 "(r) = J2ak{k- 1)' 



.fc-2 



k=2 



2a 
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If X has a geometric distribution and < p < 1 , then the mean of X is given by 

oo ^ 

E{X) = Y J ^ x - l P=-^-2=-, (1-16) 

using the formula for g 1 (x) with a = p and r = q . 

note: for example, that if p = 1/4 is the probability of success, then 

E(X) = 1/(1/4) =4 
trials are needed on the average to observe a success. 

1.5.1.3 

To find the variance of X, let first find the second factorial moment E [X (X — 1)]. We have 

-2 _ 2 W _ 2 9 



fillfl-lH^xfx-ljrt^ff^ 1 )^ 



(1 - qf P 2 ' 



£—1 X=l 

Using formula for g" (x) with a = pq and r = q . Thus the variance of X is 

Var (X) = E (X 2 ) - [E (X)} 2 = {E [X (X - 1)] + E (X)} - [E {X)f 

__ 2q , 1 1_ _ 2q+p~l _ 1-p 

p 2 p p 2 p 2 p 2 

The standard deviation of X is 



a = v(i -p) Ip 2 - 



Example 1.18 

Continuing with example 1 (Example 1.17), with p =1/4, we obtain 



1/4 
(1/4) 2 



CT 2 = 3/4 = 



and 

cr = Vl2 = 3.464. 

note: Binomial Distribution (Section 1.4.1.1.1) 

note: Poisson Distribution (Section 1.6.1: POISSON DISTRIBUTION) 

1.6 POISSON DISTRIBUTION 6 
1.6.1 POISSON DISTRIBUTION 

Some experiments results in counting the number of times particular events occur in given times of on given 
physical objects. For example, we would count the number of phone calls arriving at a switch board between 
9 and 10 am, the number of flaws in 100 feet of wire, the number of customers that arrive at a ticket window 
between 12 noon and 2 pm, or the number of defects in a 100-foot roll of aluminum screen that is 2 feet 



6 This content is available online at <http://cnx.Org/content/ml3125/l.3/>. 
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wide. Each count can be looked upon as a random variable associated with an approximate Poisson process 
provided the conditions in the definition below are satisfied. 

Definition 1.5: POISSON PROCCESS 

Let the number of changes that occur in a given continuous interval be counted. We have an 
approximate Poisson process with parameter A > if the following are satisfied: 

1. The number of changes occurring in nonoverlapping intervals are independent. 

2. The probability of exactly one change in a sufficiently short interval of length h is approximately Xh . 

3. The probability of two or more changes in a sufficiently short interval is essentially zero. 



1.6.1.1 

Suppose that an experiment satisfies the three points of an approximate Poisson process. Let X denote 
the number of changes in an interval of "length 1" (where "length 1" represents one unit of the quantity 
under consideration). We would like to find an approximation for P (X = x) , where x is a nonnegative 
integer. To achieve this, we partition the unit interval into n subintervals of equal length 1/n. If N is 
sufficiently large (i.e., much larger than x), one shall approximate the probability that x changes occur in 
this unit interval by finding the probability that one change occurs exactly in each of exactly x of these n 
subintervals. The probability of one change occurring in any one subinterval of length 1/n is approximately 
A (1/n) by condition (2). The probability of two or more changes in any one subinterval is essentially zero 
by condition (3). So for each subinterval, exactly one change occurs with a probability of approximately 
A {1/n) . Consider the occurrence or nonoccurrence of a change in each subinterval as a Bernoulli trial. By 
condition (1) we have a sequence of n Bernoulli trials with probability p approximately equal to A (1/n). 
Thus an approximation for P (X = x) is given by the binomial probability 

XV ( A x "~ 



x\ (n — a;)! \n 

In order to obtain a better approximation, choose a large value for n. If n increases without bound, we 
have that 

ta , " ! „(T)7l - tY- . „» ■(-!) ■■■(-« + !)*•(. _ AVA _ A 

n-*oox\ (n — x)\ \ny \ nj n— >oo n x x\ \ nj \ n 

Now, for fixed x, we have 



x(n — l)...(n—x-\-l) 

Hm(l-A)» = e- 



Hm nyn-i,...^-*-^, = j im rj /j _ IN /j _ x^l\l = ! 



n — >oo 



and 

, ( x 
lim 1 

n^oo y n 

n! /AW X\ n - X X x e- X 
lim— - 1-- = — = P(X = x), 

rwoox! (n — x)\ \n / \ nj x\ 

approximately. The distribution of probability associated with this process has a special name. 



Thus, 
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1.6.1.1.1 

Definition 1.6: POISSON DISTRIBUTION 

We say that the random variable X has a Poisson distribution if its p.d.f. is of the form 

\ x e~ x 
/(*) = j— ,3 = 0,1,2,..., 

where A > 0. 

It is easy to see that / (x) enjoys the properties pf a p.d.f. because clearly / (x) > and, from the 
Maclaurin's series expansion of e A , we have 



E 



x„ — A 



X x e 



E 



A r 



e~ x e x = 1. 



To discover the exact role of the parameter A > , let us find some of the characteristics of the Poisson 
distribution . The mean for the Poisson distribution is given by 



E(X) = J2 



x „ — A 



X x e 



x=Q 



, X (^1)!' 

x—l v / 



because (0) / (0) = and x/xl = 1/ (x — 1)! , when x > 
If we let k = x — 1 , then 



00 \fc+l °° \k 



fe=0 



/,-! 



fe=0 



k\ 



That is, the parameter A is the mean of the Poisson distribution. On the Figure 1 (Figure 1.2: 
Poisson Distribution) is shown the p.d.f. and c.d.f. of the Poisson Distribution for A = 1, A = 4, A = 10. 



Poisson Distribution 



0.4 r 





Figure 1.2: The p.d.f. and c.d.f. of the Poisson Distribution for A = 1, A = 4, A = 10. (a) The p.d.f. 
function, (b) The c.d.f. function. 



To find the variance, we first determine the second factorial moment E [X (X — 1)]. We have, 
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oo , x _\ oo , 



(x-2)!' 

x=0 x=2 y ' 

because (0) (0 - 1) / (0) = 0, (1) (1 - 1) / (1) = , and x{x-l) /x\ = 1/ (x - 2)! , when x > 1 . 
If we let k = x — 2 , then 

E[X(X- 1)] = e"* £ V = A2e " E F = A ^ V = A2 ' 

fe=0 ' fc=0 

Thus, 

Var {X) = E (X 2 ) - [E {X)f =E[X(X-1)]+E {X) - [E {X)f = A 2 + A - A 2 = A. 
That is, for the Poisson distribution, \x = o 2 = A . 

1.6.1.1.2 

Example 1.19 

Let X have a Poisson distribution with a mean of A = 5 , (it is possible to use the tabularized 
Poisson distribution). 

P (X > 5) = 1 - P (X < 5) = 1 - 0.616 = 0.384, 

and 

P (X = 6) = P {X < 6) - P {X < 5) = 0.762 - 0.616 = 0.146. 

Example 1.20 

Telephone calls enter a college switchboard on the average of two every 3 minutes. If one assumes 
an approximate Poisson process, what is the probability of five or more calls arriving in a 9-minute 
period? Let X denotes the number of calls in a 9-minute period. We see that E (X) = 6 ; that is, 
on the average, sic calls will arrive during a 9-minute period. Thus using tabularized data, 

P (X > 5) = 1 - P (X < 4) = 1 -^2 — -j— = * - °- 285 = °- 715 - 

x=a 



1.6.1.1.3 

note: Not only is the Poisson distribution important in its own right, but it can also be used to 
approximate probabilities for a binomial distribution. 

If X has a Poisson distribution with parameter A , we saw that with n large, 

n \ f \Y f A x " •' 



where, p = X/n so that A = np in the above binomial probability. That is, if X has the binomial distribution 
b(n,p) with large n, then 
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This approximation is reasonably good if n is large. But since A was fixed constant in that earlier 
argument, p should be small since np = X . In particular, the approximation is quite accurate if n > 20 and 
p < 0.05 , and it is very good if n > 100 and np < 10 . 

1.6.1.1.4 

Example 1.21 

A manufacturer of Christmas tree bulbs knows that 2% of its bulbs are defective. Approximate 
the probability that a box of 100 of these bulbs contains at most three defective bulbs. Assuming 
independence, we have binomial distribution with parameters p=0.02 and n=100. The Poisson 
distribution with A = 100 (0.02) = 2 gives 

' 2 x e' 2 



_^ = 0.857, 

x=0 

using the binomial distribution, we obtain, after some tedious calculations, 



:i ' 100 



Y, (0.02no.98) 100 ^ = 0.859. 



x=a 



Hence, in this case, the Poisson approximation is extremely close to the true value, but much 
easier to find. 



Chapter 2 

Continuous Distributions 

2.1 CONTINUOUS DISTRIBUTION 1 
2.1.1 CONTINUOUS DISTRIBUTION 

2.1.1.1 RANDOM VARIABLES OF THE CONTINUOUS TYPE 

Random variables whose spaces are not composed of a countable number of points but are intervals or a 
union of intervals are said to be of the continuous type. Recall that the relative frequency histogram h (x) 
associated with n observations of a random variable of that type is a nonnegative function defined so that 
the total area between its graph and the x axis equals one. In addition, h (cc) is constructed so that the 
integral 

b 
h{x)dx (2.1) 



is an estimate of the probability P (a < X < b) , where the interval (o, b) is a subset of the space R of the 
random variable X. 

Let now consider what happens to the function h (x) in the limit, as n increases without bound and as 
the lengths of the class intervals decrease to zero. It is to be hoped that h (x) will become closer and closer 
to some function, say / (x) , that gives the true probabilities , such as P (a < X < b) , through the integral 

b 
P{a< X <b)= f{x) dx. (2.2) 

a 

Definition 2.1: PROBABILITY DENSITY FUNCTION 

1. Function f(x) is a nonnegative function such that the total area between its graph and the x axis 
equals one. 

2. The probability P (a < X < b) is the area bounded by the graph of / (x) , the x axis, and the 
lines x = a and x = b . 

3. We say that the probability density function (p.d.f.) of the random variable X of the 
continuous type, with space R that is an interval or union of intervals, is an integrable function 
/ (x) satisfying the following conditions: 

• / (x) > , x belongs to R, 



lr This content is available online at <http://cnx.Org/content/ml3127/l.4/>. 
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• Jf(x)dx = l, 

R 

• The probability of the event A belongs to R is P (X) e Aj f (x) dx. 

Example 2.1 

Let the random variable X be the distance in feet between bad records on a used computer tape. 
Suppose that a reasonable probability model for X is given by the p.d.f. 

f(x)—e- x/40 ,0<x<oo. 

n, 40 

note: R = (x : < x < oo ) and / (x) for x belonging to R, 

oo 

/ f(x)dx= f—e- x / 40 dx= lim fe-'H^l- lim e - 6 / 40 = 1. 

J R J 40 b^oo L J b^oo 



The probability that the distance between bad records is greater than 40 feet is 

oo 

P{X > 40) = fj-e- x ^°dx = e- 1 = 0.368. 

40 

The p.d.f. and the probability of interest are depicted in FIG.l (Figure 2.1). 
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Figure 2.1: The p.d.f. and the probability of interest. 



We can avoid repeated references to the space R of the random variable X, one shall adopt the same 
convention when describing probability density function of the continuous type as was in the discrete case. 

Let extend the definition of the p.d.f. / (x) to the entire set of real numbers by letting it equal zero when, 
x belongs to R. For example, 

J_ e -a;/40 

/(*) = { 4 ° ,0<z<oo, 

0, elsewhere, 

has the properties of a p.d.f. of a continuous-type random variable x having support (x : < x < oo) . 
It will always be understood that / (x) = , when x belongs to R, even when this is not explicitly written 
out. 



2.1.1.2 



Definition 2.2: PROBABILITY DENSITY FUNCTION 

1. The distribution function of a random variable X of the continuous type, is defined in terms of 
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the p.d.f. of X, and is given by 

X 

F{x) = P(X<x) = [ f(t)dt. 



2. For the fundamental theorem of calculus we have, for x values for which the derivative F' (x) 
exists, that F'(x)=f(x). 

Example 2.2 

continuing with Example 1 (Example 2.1) 
If the p.d.f. of X is 

0,-oo<a;<0, 
/ (x) = 1 

±e-*/ io ,0<x<oo, 

The distribution function of X is F (x) = for x < 

X X 

F(x)= If (t) dt = f^e-^dt = -e^ 4 X = 1 - e-*/ 40 . 

-oo 



NOTE: 



0,-oo < x < 0, 
F (x) = \ 

\e" a; / 40 ,0< a :<oo. 



40 

Also F' (0) does not exist. Since there are no steps or jumps in a distribution function F (x) , of the 
continuous type, it must be true that 

P (X = b) = 

for all real values of b. This agrees with the fact that the integral 

b 

I f (x) dx 

a 

is taken to be zero in calculus. Thus we see that 

P {a < X < b) = P {a < X < b) = P {a < X < b) = P {a < X < b) = F {b) - F (a) , 

provided that X is a random variable of the continuous type. Moreover, we can change the definition of a 
p.d.f. of a random variable of the continuous type at a finite (actually countable) number of points without 
alerting the distribution of probability. 
For illustration, 

0,-oo<a;<0, 
/ (x) = \ 

l a e-*l 4 \0<x<OG, 

and 

0,-oo < x < 0, 

^e- :E / 40 ,0<a;<oo, 
are equivalent in the computation of probabilities involving this random variable. 
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Example 2.3 

Let Y be a continuous random variable with the p.d.f. g (y) = 2y , < y < 1 . The distribution 
function of Y is defined by 



G{y) 



( 0,y<0, 

1,2/>1, 

y 
. J2tdt = y 2 ,0 < y < 1. 
\ o 



Figure 2 (Figure 2.2) gives the graph of the p.d.f. g (y) and the graph of the distribution function 
G(y). 





Figure 2.2: The p.d.f. and the probability of interest. 



2.1.1.2.1 

For illustration of computations of probabilities, consider 



"i<^HG-°G 



5 

Tg 
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and 



P I - < Y < 2 



G(2)-G 



15 
16' 



note: The p.d.f. / (x) of a random variable of the discrete type is bounded by one because / (x) 
gives a probability, namely / (x) = P (X = x). 

For random variables of the continuous type, the p.d.f. does not have to be bounded. The restriction is that 
the area between the p.d.f. and the x axis must equal one. Furthermore, it should be noted that the p.d.f. 
of a random variable X of the continuous type does not need to be a continuous function. 
For example, 

i < x < lor2 < x < 3, 

/(*) = { ' , , 

0, elsewhere, 

enjoys the properties of a p.d.f. of a distribution of the continuous type, and yet / (x) had discontinuities 
at x = 0,1,2, and 3. However, the distribution function associates with a distribution of the continuous 
type is always a continuous function. For continuous type random variables, the definitions associated with 
mathematical expectation are the same as those in the discrete case except that integrals replace summations. 

2.1.1.2.2 

FOR ILLUSTRATION, let X be a random variable with a p.d.f. / (x) . The expected value of X or 
mean of X is 



H = E(X)= / xf{x)dx 



The variance of X is 



a 2 = Var (X) = / (x - fj) f {x) dx 



The standard deviation of X is 



X/VarpO- 



Example 2.4 

For the random variable Y in the Example 3 (Example 2.3). 



i 
fi = E(Y) = Jy(2y) 



dy 



and 



a 2 = Var (Y) = E (Y 2 ) - fi 2 
= }y 2 { 2y)dy-{lf=[{\y% 



4 _ J_ 
9 18' 
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2.1.2 

2.2 THE UNIFORM AND EXPONENTIAL DISTRIBUTIONS 2 

2.2.1 THE UNIFORM AND EXPONENTIAL DISTRIBUTIONS 

2.2.1.1 The Uniform Distribution 

Let the random variable X denote the outcome when a point is selected at random from the interval [a, b], 
— oo < a < b < oo. If the experiment is performed in a fair manner, it is reasonable to assume that the 
probability that the point is selected from the interval [a,x], a < x < b is (x — a) (b — a). That is, the 
probability is proportional to the length of the interval so that the distribution function of X is 

/ 0,x < a, 
F ( x ) = fE%,a< x <b, 
\ l,b< x. 

Because X is a continuous-type random variable, F' (x) is equal to the p.d.f of X whenever F' (x) exists; 
thus when a < x < b, we have 

f(x) = F'(x) = l/(b-a). 

Definition 2.3: DEFINITION OF UNIFORM DISTRIBUTION 

The random variable X has a uniform distribution if its p.d.f. is equal to a constant on its 
support. In particular, if the support is the interval [a, 6], then 

f(x) = - r i —,a<x<b. (2.3) 



2.2.1.1.1 

Moreover, one shall say that X is U (a, b). This distribution is referred to as rectangular because the graph 
of f (x) suggest that name. See Figurel. (Figure 2.3) for the graph of f (x) and the distribution function 

F(x). 



2 This content is available online at <http://cnx.Org/content/ml3128/l.7/>. 
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h—a 



0.1 



X 



-5 



2 

a 



7 

b 



w 



Figure 2.3: The graph of the p.d.f. of the uniform distriution. 



note: We could have taken / (a) = or / (6) = without alerting the probabilities, since this is 
a continuous type distribution, and it can be done in some cases. 

The mean and variance of X are as follows: 



and 



12 



An important uniform distribution is that for which a=0 and b =1, namely {7(0,1). If X is {7(0,1), 
approximate values of X can be simulated on most computers using a random number generator. In fact, 
it should be called a pseudo-random number generator (see the pseudo-numbers generation (Section 5.3.1: 
THE IVERSE PROBABILITY METHOD FOR GENERATING RANDOM VARIABLES)) because the 
programs that produce the random numbers are usually such that if the starting number is known, all 
subsequent numbers in the sequence may be determined by simple arithmetical operations. 
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2.2.1.2 An Exponential Distribution 

Let turn to the continuous distribution that is related to the Poisson distribution (Section 1.6.1: POISSON 
DISTRIBUTION). When previously observing a process of the approximate Poisson type, we counted the 
number of changes occurring in a given interval. This number was a discrete-type random variable with a 
Poisson distribution. But not only is the number of changes a random variable; the waiting times between 
successive changes are also random variables. However, the latter are of the continuous type, since each of 
then can assume any positive value. 

Let W denote the waiting time until the first change occurs when observing the Poisson process (Defini- 
tion: "POISSON PROCCESS", p. 19) in which the mean number of changes in the unit interval is A. Then 
W is a continuous-type random variable, and let proceed to find its distribution function. 

Because this waiting time is nonnegative, the distribution function F (w) = 0, w < 0. For w > 0, 

F (w) = P (W < w) = 1 - P (W > w) = 1 - P {no_changes_in_ [0,w]) = l- e~ Xw ', 

since that was previously discovered that e~ Xw equals the probability of no changes in an interval of 
length w is proportional to w, namely, Xw. Thus when w >0, the p.d.f. of W is given by 

F' (w) = Xe' Xw = f (w) . 



Definition 2.4: DEFINITION OF EXPONENTIAL DISTRIBUTION 

Let A = 1/0, then the random variable X has an exponential distribution and its p.d.f. id 
defined by 

f(x)= l -e- x ' e ,Q<x<^, (2.4) 

where the parameter > 0. 

2.2.1.3 

Accordingly, the waiting time W until the first change in a Poisson process has an exponential distribution 
with = 1/A. The mean and variance for the exponential distribution are as follows: fi = and a 2 = 2 . 
So if A is the mean number of changes in the unit interval, then 

0= 1/A 

is the mean waiting for the first change. Suppose that A=7 is the mean number of changes per minute; then 
that mean waiting time for the first change is 1/7 of a minute. 
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A, = 1.0 



0.75 




0.25 



12 3 

Figure 2.4: The graph of the p.d.f. of the exponential distriution. 



JC 



Example 2.5 

Let X have an exponential distribution with a mean of 40. The p.d.f. of X is 

f(x) = —e~ x / ia ,0<x<oo. 
The probability that X is less than 36 is 

36 

P(X < 36) = J — e- x/i0 dx = 1 - e- 36/40 = 0.593. 
o 

Example 2.6 

Let X have an exponential distribution with mean fi = 0. Then the distribution function of X is 

„ , r 0,-oo < x < 0, 
F{x) = { 

l-e- x / e ,Q< x < oo. 

The p.d.f. and distribution function are graphed in the Figure 3 (Figure 2.5) for 8=5. 
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Figure 2.5: The p.d.f. and c.d.f. graphs of the exponential distriution with 6 — 5 



2.2.1.4 

note: For an exponential random variable X, we have that 

i - e~ x/t 



P(X>x) = l-F(x) = l-h 



2.2.2 



2.3 THE GAMMA AND CHI-SQUARE DISTRIBUTIONS 3 
2.3.1 GAMMA AND CHI-SQUARE DISTRIBUTIONS 

In the (approximate) Poisson process (Definition: "POISSON PROCCESS", p. 19) with mean A, we have 
seen that the waiting time until the first change has an exponential distribution (Section 2.2.1.2: An Ex- 
ponential Distribution). Let now W denote the waiting time until the ath change occurs and let find the 
distribution of W. The distribution function of W ,when w > is given by 



3 This content is available online at <http://cnx.Org/content/ml3129/l.3/>. 
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F (w) = P (W < w) = 1 — P (W > w) = 1 — P (fewer _than_a_changes _occur _in_ [0, w]) 

— 1 _ Y^ - 1 (Xw) k e~ Xw 

since the number of changes in the interval [0, w] has a Poisson distribution with mean Aw. Because W 
is a continuous-type random variable, F' (w) is equal to the p.d.f. of W whenever this derivative exists. We 
have, provided w>0, that 

F' (w) = Ae-^ - e~ Xw E^i [ "^'^ - ^p] = Ae"^ - e~^ [x - ^^] 

_ XjXw)"- 1 -\ w 
~ (a-l)l e 

2.3.1.1 Gamma Distribution 

Definition 2.5: 

1. If w < 0, then F (w) = and F' (w) = 0, a p.d.f. of this form is said to be one of the gamma 
type, and the random variable W is said to have the gamma distribution. 

2. The gamma function is defined by 

oo 

T(t)= [y'-'e-ydy^Kt. 



This integral is positive for < t, because the integrand id positive. Values of it are often given in a 
table of integrals. If t > 1, integration of gamma faction of t by parts yields 

oo oo 

r (t) = [-y'-V] ~ + J (t - 1) y*- 2 e-Uy = (t - 1) Jy^e^dy = (t - 1) T (t - 1) . 

o o 

Example 2.7 

Let r(6) = 5r(5) and T(3) = 2r (2) = (2)(1)T(1). Whenever t = n, a positive integer, 
we have, be repeated application of T (t) = (t—l)T(t — l), that T (n) = [n — 1) T (n — 1) = 
(n-l)(n-2)...(2)(l)r(l). 
However, 



T(l) = e - y dy= 1. 

o 

Thus when n is a positive integer, we have that T (n) = (n — 1)!; and, for this reason, the gamma 
is called the generalized factorial. 

Incidentally, T(l) corresponds to 0!, and we have noted that T(l) = 1, which is consistent with earlier 
discussions. 

2.3.1.1.1 SUMMARIZING 

The random variable x has a gamma distribution if its p.d.f. is defined by 

1 {x) = vjko^ xa ' le ' X/e ' ° - x < °°' (2 " 5) 
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Hence, w, the waiting time until the a th change in a Poisson process, has a gamma distribution with 
parameters a and 9 = 1/A. 

Function / (x) actually has the properties of a p.d.f., because / (x) > and 

x a-i e -x/e 
f (x) dx = I —^-, — t^t — dx, 



ii 



T(a)9 a 



which, by the change of variables y = x/9 equals 



r (a) 9 a 



-9dy 



' r y <*-i e -v dy= lM = l. 



r(a) 



r(a) 



The mean and variance are: /i = a9 and a 2 = a9 2 




6 i 10 12 14 lb 

(a) Gamma Distribution 



Figure 2.6: 

p.d.f. graph. 



The p.d.f. and c.d.f. graphs of the Gamma Distribution, (a) The c.d.f. graph, (b) The 



2.3.1.1.2 



Example 2.8 

Suppose that an average of 30 customers per hour arrive at a shop in accordance with Poisson 
process. That is, if a minute is our unit, then A = 1/2. What is the probability that the shopkeeper 
will wait more than 5 minutes before both of the first two customers arrive? If X denotes the 
waiting time in minutes until the second customer arrives, then X has a gamma distribution with 
a = 2, 9= 1/A = 2. Hence, 



p(X >5) 



2-1 -z/2 



r(2)2 5 



-dx 



xe 



-x/2 



-dx 



"1 oo 7 

{-2)xe~ x ' 2 -Ae- x/2 \ = -e" 5 / 2 = 0.287 
y ' J 5 2 



5 5 

We could also have used equation with A = 1/9, because a is an integer 



fe=0 



{x/9)e- x / e 



k\ 
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Thus, with x=5, a=2, and 9 = 2, this is equal to 

2-1 /r ln\k 



fe=0 x / \ / 



e~ 5/2 . 



2.3.1.2 Chi-Square Distribution 

Let now consider the special case of the gamma distribution that plays an important role in statistics. 

Definition 2.6: 

Let X have a gamma distribution with 9 = 2 and a = r/2, where r is a positive integer. If the 
p.d.f. of X is 

/ 0) = — — ^ ^a^-V*/ 2 , < x < oo. (2.6) 

r(r/2)2 r / 2 y J 

We say that X has chi-square distribution with r degrees of freedom, which we abbreviate by 
saying is \ 2 ( r )- 
The mean and the variance of this chi-square distributions are 

V = a9=(£j2 = r 

and 

a 2 = a9 2 = (0 2 2 = 2r. 

That is, the mean equals the number of degrees of freedom and the variance equals twice the number of 
degrees of freedom. 

In the fugure 2 (Figure 2.7) the graphs of chi-square p.d.f. for r=2,3,5, and 8 are given. 
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Figure 2.7: The p.d.f. of chi-square distribution for degrees of freedom r=2,3,5,8. 



note: the relationship between the mean /i = r, and the point at which the p.d.f. obtains its 
maximum. 

Because the chi-square distribution is so important in applications, tables have been prepared giving the 
values of the distribution function for selected value of r and x, 



F(x) 



r(r/2)2 r / 2 



,f/2-l e - W /2 dw _ 



(2.7) 



Example 2.9 

Let X have a chi-square distribution with r =5 degrees of freedom. Then, using tabularized values, 



and 



P (1.145 < X < 12.83) = F (12.83) - F (1.145) = 0.975 - 0.050 = 0.925 
P (X > 15.09) = 1- F (15.09) = 1 - 0.99 = 0.01. 



Example 2.10 

If X is x 2 (7), two constants, a and b, such that P (a < X < b) = 0.95, are a=1.690 and b=16.01. 
Other constants a and b can be found, this above are only restricted in choices by the limited 
table. 
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Probabilities like that in Example 4 (Example 2.10) are so important in statistical applications that one uses 
special symbols for a and b. Let a be a positive probability (that is usually less than 0.5) and let X have a 
chi-square distribution with r degrees of freedom. Then \a (r) ls a number such that P\X > x 2 a ( r )] = a 

That is, Xa (t) ls the 100(l-a) percentile (or upper 100a percent point) of the chi-square distribution with 
r degrees of freedom. Then the 100a percentile is the number Xi- a ( r ) such that P [X < x\- a ( r )] = a - 
This is, the probability to the right of xi-a ( r ) m l~ a - SEE fugure 3 (Figure 2.8). 

Example 2.11 

Let X have a chi-square distribution with seven degrees of freedom. Then, using tabularized values, 
Xo 05 (7) = 14.07 and x\ 95 (7) = 2.167. These are the points that are indicated on Figure 3. 



014 



012- 




Figure 2.8: X o.o 5 (7) = 14.07 and X0.95 (7) = 2.167. 
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2.3.2 

2.4 NORMAL DISTRIBUTION 4 

2.4.1 NORMAL DISTRIBUTION 

The normal distribution is perhaps the most important distribution in statistical applications since many 
measurements have (approximate) normal distributions. One explanation of this fact is the role of the normal 
distribution in the Central Theorem. 

Definition 2.7: 

1. The random variable X has a normal distribution if its p.d.f. is defined by 



fix) 



1 



0-V27T 



exp 



(x-vY 

2a 2 



-00 < X < 00, 



(2.S 



where [i and a 2 are parameters satisfying — oo</i<oo,0<cr<oo, and also where exp [v] means 
e v . 
2. Briefly, we say that X is N (/i, a 2 ) 

2.4.1.1 Proof of the p.d.f. properties 

Clearly, / (x) > . Let now evaluate the integral: 



1 



0-V27T 



exp 



{x- nY 

2a 2 



dx, 



showing that it is equal to 1. In the integral, change the variables of integration by letting z = (x — //) /<t. 
Then, 



'27T 



,er z / 2 dz, 



since / > , if I 2 = 1 , then 1=1. 

Now 



2tt 



~ x ' 2 dx 



e- y /2 dy 



or equivalently, 



1 = 2~, I I 6XP 



x 2 + y 2 



dxdy. 



Letting x = rcos6,y = rsm8 (i.e., using polar coordinates), we have 



1 



1 



/ - = — / [e- r2/2 rdrd6 = — f d9 = —2tt = I . 
2nJ J 2ttJ 2tt 





4 This content is available online at <http://cnx.Org/content/ml3130/l.4/>. 
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2.4.1.2 

The mean and the variance of the normal distribution is as follows: 



and 



Var(X) = fi 2 + a 2 - 



// 



a 2 . 



That is, the parameters /i and a 2 in the p.d.f. are the mean and the variance of X. 



Normal Distribution 





fl m o, el ■ SJ) 

|l = -5,ir=0_5 — 



(a) 



(b) 



Figure 2.9: p.d.f. and c.d.f graphs of the Normal Distribution (a) Probability Density Function (b) 
Cumulative Distribution Function 



Example 2.12 

If the p.d.f. of X is 



/(*) 



/32tt 



exp 



32 



-00 < X < 00, 



then X is TV (-7, 16) 

That is, X has a normal distribution with a mean fj, =-7, variance a 2 =16, and the moment 
generating function 

M (t) = exp {-It + St 2 ) . 



2.5 THE t DISTRIBUTION 5 



2.5.1 THE t DISTRIBUTION 



In probability and statistics, the t-distribution or Student's distribution arises in the problem of esti- 
mating the mean of a normally distributed population when the sample size is small, as well as when (as in 



5 This content is available online at <http://cnx.Org/content/ml3495/l.3/>. 



41 

nearly all practical statistical work) the population standard deviation is unknown and has to be estimated 
from the data. 

Textbook problems treating the standard deviation as if it were known are of two kinds: 

1. those in which the sample size is so large that one may treat a data-based estimate of the variance as 
if it were certain, 

2. those that illustrate mathematical reasoning, in which the problem of estimating the standard deviation 
is temporarily ignored because that is not the point that the author or instructor is then explaining. 

2.5.1.1 THE t DISTRIBUTION 

Definition 2.8: t Distribution 

If Z is a random variable that is TV (0, 1), if U is a random variable that is x 2 i r ), an d if Z and U 

are independent, then 

„ Z X — ix , 

T = -Jwr = -*T1= 2 - 9 

has a t distribution with r degrees of freedom. 

Where ix is the population mean, x is the sample mean and s is the estimator for population standard 
deviation (i.e., the sample variance) defined by 

N 
2 



L^fo-af). (2.10) 



S N . 

»=i 



2.5.1.1.1 

If a = s, t = z, the distribution becomes the normal distribution. As N increases, Student's t distribu- 
tion approaches the normal distribution (Section 2.4.1: NORMAL DISTRIBUTION). It can be derived by 
transforming student's z-distribution using 

_ x — xi 



and then defining 



t = z\/n — 1. 
The resulting probability and cumulative distribution functions are: 



F(t)= 1 - + 1 - 
w 2 2 

where, 



, 1 1\ / r 1 1 

/Il; 2 r '2j- J (^'2 r '2 



f{t) - r ^+l)/2] 

H) V^r(r/2)(l + i 2/r) (r+1)/2 ' 

1 rt2j(-£;i,i(l-r))r(i(r+l)) 

sgn {t) = 9 9 ruwfh (2 - 12) 

2 2y/^\t\r(±r) 



• r = n — 1 is the number of degrees of freedom, 

• — oo < t < oo, 

• r (z) is the gamma function, 

• B (a, b) is the bets function, 

• / (z; a, b) is the regularized beta function defined by 



ti u\ B{z;a,b) 

/(z;fl ' 6) = ^MT' 
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2.5.1.1.2 

The effect of degree of freedom on the t distribution is illustrated in the four t distributions on the Figure 
1 (Figure 2.10). 




-3,18 .196 
-2A5 



l crit 



1S6 3.18 
2A$ 



Figure 2.10: p.d.f. of the t distribution for degrees of freedom r=3, r=6, r=oo. 



In general, it is difficult to evaluate the distribution function of T. Some values are usually given in the 
tables. Also observe that the graph of the p.d.f. of T is symmetrical with respect to the vertical axis t =0 
and is very similar to the graph of the p.d.f. of the standard normal distribution N (0, 1). However the tails 
of the t distribution are heavier that those of a normal one; that is, there is more extreme probability in the 
t distribution than in the standardized normal one. Because of the symmetry of the t distribution about t 
=0, the mean (if it exists) must be equal to zero. That is, it can be shown that E (T) = when r > 2. When 
r=l the t distribution is the Cauchy distribution, and thus both the variance and mean do not exist. 



Chapter 3 

Estimation 

3.1 Estimation 1 

3.1.1 ESTIMATION 

Once a model is specified with its parameters and data have been collected, one is in a position to evaluate 
the model's goodness of fit, that is, how well the model fits the observed pattern of data. Finding parameter 
values of a model that best fits the data — a procedure called parameter estimation, which assesses 
goodness of fit. 

There are two generally accepted methods of parameter estimation. They are least squares estimation 
(LSE) and maximum likelihood estimation (MLE). The former is well known as linear regression, the 
sum of squares error, and the root means squared deviation is tied to the method. On the other hand, 
MLE is not widely recognized among modelers in psychology, though it is, by far, the most commonly used 
method of parameter estimation in the statistics community. LSE might be useful for obtaining a descriptive 
measure for the purpose of summarizing observed data, but MLE is more suitable for statistical inference 
such as model comparison. LSE has no basis for constructing confidence intervals or testing hypotheses 
whereas both are naturally built into MLE. 

3.1.1.1 Properties of Estimators 

UNBIASED AND BIASED ESTIMATORS 

Let consider random variables for which the functional form of the p.d.f. is know, but the distribution 
depends on an unknown parameter 9 , that may have any value in a set 9 , which is called the parameter 
space. In estimation the random sample from the distribution is taken to elicit some information about 
the unknown parameter 9. The experiment is repeated n independent times, the sample X\, X 2 , ...,X n is 
observed and one try to guess the value of 9 using the observations x\, X2, ■■■X n . 

The function of X\,X 2 , ...,-Xn used to guess 9 is called an estimator of 9 . We want it to be such 
that the computed estimate u (x\, x 2l ■■■x n ) is usually close to 9. Let Y = u [x\, x 2 , ■■■x n ) be an estimator of 
9. If Y to be a good estimator of 9 , a very desirable property is that it means be equal to 9 , namely 
E (Y) = . 

Definition 3.1: 

If E [u (x\,x 2 , ...,x n )] = 9 is called an unbiased estimator of 9. Otherwise, it is said to be 
biased. 

It is required not only that an estimator has expectation equal to 9, but also the variance of the estimator 
should be as small as possible. If there are two unbiased estimators of 6, it could be probably possible to 
choose the one with the smaller variance. In general, with a random sample X\, X 2 , ..., X n of a fixed sample 



lr This content is available online at <http://cnx.Org/content/ml3524/l.2/>. 
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size n, a statistician might like to find the estimator Y = u (Xi,X2, ...,X n ) of an unknown parameter 
which minimizes the mean (expected) value of the square error (difference) Y — 9 that is, minimizes 

E[(Y-6) 2 ] = E{[u(X ll X 2 ,...,X n )-6} 2 }. 

The statistic Y that minimizes E \(Y — 6) is the one with minimum mean square error. If we restrict 
our attention to unbiased estimators only, then 

Var{Y) =e\(Y-6 

and the unbiased statistics Y that minimizes this expression is said to be the unbiased minimum variance 
estimator of 8 . 

3.1.1.2 Method of Moments 

One of the oldest procedures for estimating parameters is the method of moments. Another method for 
finding an estimator of an unknown parameter is called the method of maximum likelihood. In general, 
in the method of moments, if there are k parameters that have to be estimated, the first k sample moments 
are set equal to the first k population moments that are given in terms of the unknown parameters. 

Example 3.1 

Let the distribution of X be N (/j,, a 2 ) . Then E (X) = \i and E (X 2 ) = a 2 + fi 2 . Given a random 
sample of size n, the first two moments are given by 

1 " 
mi = -J~)xi 

and 

1 ™ 

m 2 = ~y^xi. 

n *—* 

i=l 

We set mi = E (X) and rri2 = E (X 2 ) and solve for \i and <r 2 , 

1 ™ 



and 



EXi = a 2 



- 2 < /i 2 
n i — ' 

i=l 



The first equation yields x as the estimate of \x . Replacing fj, 2 with x 2 in the second equation 
and solving for a 2 , 
we obtain 



1 

1 E 

for the solution of a 2 . 



Xi — x 2 = V 



n . , 

2 — 1 



Thus the method of moment estimators for \i and a 2 are /i = X and a 2 = V. Of course, \x = X is unbiased 
whereas a 2 = V. is biased. 
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3.1.1.3 

At this stage arises the question, which of two different estimators 9 and 6, for a parameter 9 one should 
use. Most statistician select he one that has the smallest mean square error, for example, 



E 



e-e 



<E 



then 9 seems to be preferred. This means that if E 



E 9 



9, then one would select the one with 



the smallest variance. 

Next, other questions should be considered. Namely, given an estimate for a parameter, how accurate is 
the estimate? How confident one is about the closeness of the estimate to the unknown parameter? 

note: CONFIDENCE INTERVALS I (Section 3.2.1: CONFIDENCE INTERVALS I) and CON- 
FIDENCE INTERVALS II (Section 3.3.1: CONFIDENCE INTERVALS II) 



3.2 CONFIDENCE INTERVALS I 2 
3.2.1 CONFIDENCE INTERVALS I 

Definition 3.2: 

Given a random sample Xi, X 2 , ..., X n from a normal distribution N (/U, a 2 ), consider the closeness 
of X, the unbiased estimator of /u, to the unknown /x. To do this, the error structure (distribution) 
of X, namely that X is N (/x, <j 2 /n), is used in order to construct what is called a confidence 
interval for the unknown parameter /x, when the variance a 2 is known. 

3.2.1.1 

For the probability 1 — a , it is possible to find a number z a / 2 , such that 



P I -Z a/2 < 



X - ix 

<jj\fn 



< z a/2 1=1- a. 



For example, if 1 — a = 0.95, then z a / 2 = ^0.025 = 1-96 and if 1 — a = 0.90, then z a / 2 = ^0.05 = 1-645. 
Recalling that a > 0, the following inequalities are equivalent : 



-z a /2 < 



X - fj, 

uj\fn 



<z, 



a/2 



and 



-,::\^)<X-^<Z a/2 ^ 



-X - z, 



°/2 I 77^ I - -I* 



< -u< -X + z, 



a/2 



V ~ "" -'(^r) >»> x - z */2 {^j-- 



2 This content is available online at <http://cnx.Org/content/ml3494/l.3/>. 
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Thus, since the probability of the first of these is 1-1 — a, the probability of the last must also be 1 — a, 
because the latter is true if and only if the former is true. That is, 



X - z, 



a/2 



< IX < -X ■ 



z a/2 



I -a. 



So the probability that the random interval 

X — Z a / 2 



- X + z a/2 



includes the unknown mean \x is 1 — a . 
Definition 3.3: 

1. Once the sample is observed and the sample mean computed equal to x , the interval 

x - z a/2 (a/Vn) , x + z a/2 (a/Vn) 

is a known interval. Since the probability that the random interval covers /x before the sample is 
drawn is equal to 1 — a, call the computed interval, x ± z a / 2 (cr / yfn)(ior brevity), a 100 (1 — a) % 
confidence interval for the unknown mean [i. 

2. The number 100 (1 — a) %, or equivalently, 1 — a, is called the confidence coefficient. 



3.2.1.2 

For illustration, 



x± 1.96 (a/Vn) 



is a 95% confidence interval for /x. 

It can be seen that the confidence interval for /x is centered at the point estimate x and is completed by 
subtracting and adding the quantity z a / 2 (c/v^)- 

note: as n increases, z a / 2 (o'/y/n) decreases, resulting n a shorter confidence interval with the 
same confidence coefficient 1 — a 

A shorter confidence interval indicates that there is more reliance in x as an estimate of ^t. For a fixed sample 
size n, the length of the confidence interval can also be shortened by decreasing the confidence coefficient 
1 — a. But if this is done, shorter confidence is achieved by losing some confidence. 

Example 3.2 

Let x be the observed sample mean of 16 items of a random sample from the normal distribution 
N (/i, it 2 ). A 90% confidence interval for the unknown mean /i is 



,23.04 /23.04 

x- 1M5\I —r^—,x+ 1.645 



16 



16 



For a particular sample this interval either does or does not contain the mean /i. However, if many 
such intervals were calculated, it should be true that about 90% of them contain the mean pt. 

If one cannot assume that the distribution from which the sample arose is normal, one can still obtain 
an approximate confidence interval for fi . By the Central Limit Theorem the ratio (X — fi) / (cr/y/n) 
has, provided that n is large enough, the approximate normal distribution N (0, 1) when the underlying 
distribution is not normal. In this case 

X-H 

~ z a/2 5= T^= < z a/2 
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and 

is an approximate 100(1 — a) % confidence interval for /x. The closeness of the approximate probability 
1 -a to the exact probability depends on both the underlying distribution and the sample size. When the 
underlying distribution is unimodal (has only one mode) and continuous, the approximation is usually quite 
good for even small n, such as n = 5. As the underlying distribution becomes less normal (i.e., badly skewed 
or discrete), a larger sample size might be required to keep reasonably accurate approximation. But, in all 
cases, an n of at least 30 is usually quite adequate. 

note: Confidence Intervals II 



3.3 CONFIDENCE INTERVALS II 3 
3.3.1 CONFIDENCE INTERVALS II 

3.3.1.1 Confidence Intervals for Means 

In the preceding considerations (Confidence Intervals I (Section 3.2.1: CONFIDENCE INTERVALS I)), 
the confidence interval for the mean [i of a normal distribution was found, assuming that the value of the 
standard deviation a is known. However, in most applications, the value of the standard deviation a is 
rather unknown, although in some cases one might have a very good idea about its value. 

Suppose that the underlying distribution is normal and that a 2 is unknown. It is shown that given 
random sample X 1: X 2 , ■■■,X n from a normal distribution, the statistic 

has a t distribution with r = n — 1 degrees of freedom, where S 2 is the usual unbiased estimator of a 2 , 
(see, t distribution (Section 2.5.1: THE t DISTRIBUTION)). 
Select t a / 2 (n — 1) so that 

P[T>t a/2 (n-l)] =a/2. 

Then 

l-a = P 
= P 



P 
P 



[~t a /2 (n - 1) < ff^j= < t a/2 (n - 1)J 

-*a/2 (n~l)$i<X-li< ta/2 (« " 1) ^] 

-X - t a/2 (n - 1) ^ < -n < -X + t a/2 (n - 
X - t a/2 (n - 1) 4j < -n < X + t a/2 (n - 1) 



Thus the observations of a random sample provide a x and s 2 and x — t a j 2 (n — 1 ) -4= , x + t a / 2 (n — 1) 
is a 100 (1 — a) % interval for \i. 

Example 3.3 

Let X equals the amount of butterfat in pound produced by a typical cow during a 305-day milk 
production period between her first and second claves. Assume the distribution of X is N (/i,cr 2 ). 
To estimate [i a farmer measures the butterfat production for n-20 cows yielding the following data: 



3 This content is available online at <http://cnx.Org/content/ml3496/l.4/>. 
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481 


537 


513 


583 


453 


510 


570 


500 


487 


555 


618 


327 


350 


643 


499 


421 


505 


637 


599 


392 


- 



For these data, x = 507.50 and s 



Table 3.1 

.75. Thus a point estimate of /i is x = 507.50. Since 

or equivalently, [472.80, 



t .05 (19) = 1.729 , a 90% confidence interval for \x is 507.50± 1.729 ( ^^ 
542.20]. 



3.3.1.2 

Let T have a t distribution with n-1 degrees of freedom. Then, t a /2 (n — 1) > z a / 2 . Consequently, the 
interval x ± z a /2<j/y/n is expected to be shorter than the interval x ± t a li i n ~ 1) s/y/ri. After all, there 
gives more information, namely the value of a, in construction the first interval. However, the length of the 
second interval is very much dependent on the value of s. If the observed s is smaller than a, a shorter 
confidence interval could result by the second scheme. But on the average, x ± z a /2<J / \fn\s the shorter of 
the two confidence intervals. 

If it is not possible to assume that the underlying distribution is normal but \x and a are both unknown, 
approximate confidence intervals for /x can still be constructed using 



T 



X 



li 



S/sfil 



which now only has an approximate t distribution. 

Generally, this approximation is quite good for many normal distributions, in particular, if the underlying 
distribution is symmetric, unimodal, and of the continuous type. However, if the distribution is highly- 
skewed, there is a great danger using this approximation. In such a situation, it would be safer to use 
certain nonparametric method for finding a confidence interval for the median of the distribution. 



3.3.1.3 Confidence Interval for Variances 

The confidence interval for the variance a 2 is based on the sample variance 



S 2 



^tX><-*) 



In order to find a confidence interval for a 2 , it is used that the distribution of (n — 1) S 2 /a 2 is \ 2 { n — 1)- 
The constants a and b should selected from tabularized Chi Squared Distribution (Section 2.3.1.2: Chi- 
Square Distribution) with n-1 degrees of freedom such that 



and 



That is select a and b so that the probabilities in two tails are equal: 

a = xl~ a /2( n - !) 

b = Xa/2 ( n " !)• 
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Then, solving the inequalities, we have 

/ a 1 b \ /(n-l)S* 2 2 (n-l)S 2 

Thus the probability that the random interval 

[(n-l)S 2 /b, (n-l)S 2 /a] 

contains the unknown c 2 is 1-a. Once the values of X\,X2, ..., X n are observed to be x\,x^, ...,x n and s 2 
computed, then the interval 

[(n-l)S 2 /b, (n-l)S 2 /a] 

is a 100 (1 — a) % confidence interval for a 2 . 
It follows that 

[V(n-l)/6a, V(n-l)/as] 

is a 100 (1 — a) % confidence interval for <r, the standard deviation. 

Example 3.4 

Assume that the time in days required for maturation of seeds of a species of a flowering plant 
found in Mexico is N (/i, it 2 ). A random sample of n=13 seeds, both parents having narrow leaves, 
yielded z=18.97 days and 12s 2 = Y^Li {x - xf = 128.41. 

A confidence interval for a 2 is [ j 2 2 ^| L , ^§^] = [6.11,24.57], because 5.226 = Xo.95 ( 12 ) and 
21.03 = Xo 055 (12), what can be read from the tabularized Chi Squared Distribution. The corre- 
sponding 90% confidence interval for a is [V6.ll, V24.57] = [2.47,4.96] . 



3.3.1.4 

Although a and b are generally selected so that the probabilities in the two tails are equal, the resulting 
100 (1 — a) % confidence interval is not the shortest that can be formed using the available data. The tables 
and appendixes gives solutions for a and b that yield confidence interval of minimum length for the standard 
deviation. 

3.4 SAMPLE SIZE 4 
3.4,1 Size Sample 

Very frequently asked question in statistical consulting is, how large should the sample size be to 
estimate a mean? 

The answer will depend on the variation associated with the random variable under observation. The 
statistician could correctly respond, only one item is needed, provided that the standard deviation of the 
distribution is zero. That is, if a is equal zero, then the value of that one item would necessarily equal the 
unknown mean of the distribution. This is the extreme case and one that is not met in practice. However, 
the smaller the variance, the smaller the sample size needed to achieve a given degree of accuracy. 

Example 3.5 

A mathematics department wishes to evaluate a new method of teaching calculus that does math- 
ematics using a computer. At the end of the course, the evaluation will be made on the basis of 
scores of the participating students on a standard test. Because there is an interest in estimating 
the mean score /i, for students taking calculus using computer so there is a desire to determine 



4 This content is available online at <http://cnx.Org/content/ml3531/l.2/>. 
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the number of students, n, who are to be selected at random from a larger group. So, let find 
the sample size n such that we are fairly confident that x ± 1 contains the unknown test mean /x, 
from past experience it is believed that the standard deviation associated with this type of test 
is 15. Accordingly, using the fact that the sample mean of the test scores, X , is approximately 
N (/i, cr 2 /n), it is seen that the interval given by x± 1.96 (15/y/n) will serve as an approximate 95% 
confidence interval for /i. 

That is, 1.96 I ^j= ) = 1 or equivalently y/n = 29.4 and thus n w 864.36 or n=865 because n 

must be an integer. It is quite likely that it had not been anticipated that as many as 865 students 
would be needed in this study. If that is the case, the statistician must discuss with those involved 
in the experiment whether or not the accuracy and the confidence level could be relaxed some. For 
illustration, rather than requiring x± 1 to be a 95% confidence interval for fj,, possibly x± 2 would 
be satisfactory for 80% one. If this modification is acceptable, we now have 1.282 ( -7M = 2 or 
equivalently, ^fn = 9.615 and thus n w 92.4. Since n must be an integer = 93 is used in practice. 



3.4.1.1 

Most likely, the person involved in this project would find this a more reasonable sample size. Of course, 
any sample size greater than 93 could be used. Then either the length of the confidence interval could be 
decreased from that of x ± 2 or the confidence coefficient could be increased from 80% or a combination of 
both. Also, since there might be some question of whether the standard deviation a actually equals 15, the 
sample standard deviations would no doubt be used in the construction of the interval. 
For example, suppose that the sample characteristics observed are 

n = U5,x = 77.2,s = 13.2; 

then, x ± 1,2 f 2s or 77.2 ± 1.41 provides an approximate 80% confidence interval for \x. 

In general, if we want the 100 (1 — a) % confidence interval for (x, x ± z a / 2 (<r/^n), to be no longer than 
that given by x ± e, the sample size n is the solution of e = al X , where $ (za/2) = 1 — § • 

That is, 

z 2 ,„er 2 

a/ 2 

n= — 5 — > 
a 1 

where it is assumed that a 2 is known. 
Sometimes 

e = z a / 2 <r/Vn 

is called the maximum error of the estimate. If the experimenter has no ideas about the value of a 2 , 
it may be necessary to first take a preliminary sample to estimate a 2 . 

3.4.1.2 

The type of statistic we see most often in newspaper and magazines is an estimate of a proportion p. We 
might, for example, want to know the percentage of the labor force that is unemployed or the percentage 
of voters favoring a certain candidate. Sometimes extremely important decisions are made on the basis of 
these estimates. If this is the case, we would most certainly desire short confidence intervals for p with large 
confidence coefficients. We recognize that these conditions will require a large sample size. On the other 
hand, if the fraction p being estimated is not too important, an estimate associated with a longer confidence 
interval with a smaller confidence coefficients is satisfactory; and thus a smaller sample size can be used. 
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In general, to find the required sample size to estimate p, recall that the point estimate of p is 



P= z, 



«/2\ 



V 1- P 



II 

Suppose we want an estimate of p that is within e of the unknown p with 100 (1 — a) % confidence where 
P I 1— P I /n is the maximum error of the point estimate p= y/n. Since P is unknown 



e = z a/2 

before the experiment is run, we cannot use the value of P in our determination of n. However, if it is known 
that p is about equal to p* , the necessary sample size n is the solution of 

_ Z a/2 y / p* (1 ~p*) 
That is, 

*£ /2 p*(i-p*) 

n= p • 

3.5 Maximum Likelihood Estimation (MLE) 5 

3.5.1 MAXIMUM LIKELIHOOD ESTIMATION (MLE) 

3.5.1.1 Likelihood function 

From a statistical standpoint, the data vector x = (xi,x 2 , ...,x n ) as the outcome of an experiment is a 
random sample from an unknown population. The goal of data analysis is to identify the population 
that is most likely to have generated the sample. In statistics, each population is identified by a 
corresponding probability distribution. Associated with each probability distribution is a unique value of 
the model's parameter. As the parameter changes in value, different probability distributions are generated. 
Formally, a model is defined as the family of probability distributions indexed by the model's parameters. 

Let denote the probability distribution function (PDF) by f (x\6) that specifies the probability 
of observing data y given the parameter w. The parameter vector 9 = (9 1 ,0 2 , ■■■,9k) is a vector defined 
on a multi-dimensional parameter space. If individual observations, Xi's are statistically independent of 
one another, then according to the theory of probability, the PDF for the data x = (x\,X2,---,x n ) can be 
expressed as a multiplication of PDFs for individual observations, 

f(x,9) = f( Xll 9)f(x 2 ,9)---f( Xn ,9), 

n 

L(9) = l[f(x l \9). 

To illustrate the idea of a PDF, consider the simplest case with one observation and one parameter, 
that is, n = k = 1. Suppose that the data x represents the number of successes in a sequence of 10 
independent binary trials (e.g., coin tossing experiment) and that the probability of a success on any one 
trial, represented by the parameter, 9 is 0.2. The PDF in this case is then given by 

1 {Ae = °- 2) = d7I^)! (0 - 2)3:(0 - 8)10 ' X ' {x = 0A ' -' 10) ' 



5 This content is available online at <http://cnx.Org/content/ml3501/l.3/>. 
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which is known as the binomial probability distribution. The shape of this PDF is shown in the top panel 
of Figure 1 (Figure 3.1). If the parameter value is changed to say w = 0.7, a new PDF is obtained as 



/(ar|0 = O.7) 



10! 



cc!(10-a;)! 



(0.7) x (0.3) 10 ~ x ,(a; = 0.1,...,10): 



whose shape is shown in the bottom panel of Figure 1 (Figure 3.1). The following is the general 
expression of the binomial PDF for arbitrary values of 6 and n: 



f{x\e) 



9 x (l-8) n ~ x ,0 < 6 < l,cc = 0.1,...,rj; 



9\{n-x)\ 

which as a function of y specifies the probability of data y for a given value of the parameter 8 . The 
collection of all such PDFs generated by varying parameter across its range (0 - 1 in this case) defines a 
model. 
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Figure 3.1: Binomial probability distributions of sample size n = 10 and probability parameter 
0.2 (top) and 9 = 0.7 (bottom). 



53 

3.5.1.2 Maximum Likelihood Estimation 

Once data have been collected and the likelihood function of a model given the data is determined, one is 
in a position to make statistical inferences about the population, that is, the probability distribution that 
underlies the data. Given that different parameter values index different probability distributions (Figure 1 
(Figure 3.1)), we are interested in finding the parameter value that corresponds to the desired PDF. 

The principle of maximum likelihood estimation (MLE), originally developed by R. A. Fisher in 
the 1920s, states that the desired probability distribution be the one that makes the observed data most 
likely, which is obtained by seeking the value of the parameter vector that maximizes the likelihood function 
(Section 3.5.1: MAXIMUM LIKELIHOOD ESTIMATION (MLE)) L{9). The resulting parameter, which 
is sought by searching the multidimensional parameter space, is called the MLE estimate, denoted by 

6MLE = {6±MLE, ...,6 k MLE) . 



3.5.1.2.1 

Let p equal the probability of success in a sequence of Bernoulli trials or the proportion of the large population 
with a certain characteristic. The method of moments estimate for p is relative frequency of success (having 
that characteristic). It will be shown below that the maximum likelihood estimate for p is also the relative 
frequency of success. 

Suppose that X is b(l,p) so that the p.d.f. of X is 

/ (x;p) = P x (l - P )^ x ,x = 0,1,0 < p < 1. 

Sometimes is written 

pen=[p:0<p<l}, 

where Q, is used to represent parameter space, that is, the space of all possible values of the parameter. A 
random sample X\,X 2 , ■■■,X n is taken, and the problem is to find an estimator u (Xi,X2, ...,X n ) such that 
u (xi, x 2 , ■■■,x n ) is a good point estimate of p, where x\,x 2 , ■■■ 1 x ri are the observed values of the random 
sample. Now the probability that X\,X 2 , ...,X n takes the particular values is 

n 

p(i 1 = I1 ,..,i„ = I „) = np i -(i-p) 1 " , =pE i .( i -pr El '. 

which is the joint p.d.f. of X\, X 2 , ..., X n evaluated at the observed values. One reasonable way to proceed 
towards finding a good estimate of p is to regard this probability (or joint p.d.f.) as a function of p and 
find the value of p that maximizes it. That is, find the p value most likely to have produced these sample 
values. The joint p.d.f., when regarded as a function of p, is frequently called the likelihood function. 
Thus here the likelihood function is: 

L(p) = L(p;x 1 ,x 2 ,...,x n ) = f{x 1 ;p)f{x 2 ;p) ■ ■ ■ f(x n ;p) = p^ Xi (l -p)" _I>i ,0 < p < 1. 
To find the value of p that maximizes L (p) first take its derivative for < p < 1 : 

^ = (2*,) J ,»-E-«(i- p )»-E-._( n _2x i )pE.. ( i_ J ,)-s:-*-i. 

Setting this first derivative equal to zero gives 

p£*«(l- p ) n -S>< 



2 / X{ Tl 2_^ Xi 



1 — p 



0. 
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Since < p < 1, this equals zero when 



E z» n - E 



■■'■■ /' 



p 1 -p 

Or, equivalently, 

E«J - 

p = = X. 

n 
The corresponding statistics, namely E^Q/ n = X, is called the maximum likelihood estimator and 

is denoted by P ,that is, 

P =-y / x i = x. 

»=i 
When finding a maximum likelihood estimator, it is often easier to find the value of parameter that 
minimizes the natural logarithm of the likelihood function rather than the value of the parameter that 
minimizes the likelihood function itself. Because the natural logarithm function is an increasing function, 
the solution will be the same. To see this, the example which was considered above gives for < p < 1, 

InL (p) = I ^2 x i I ln P + I n ~ X] x% ) ln ^ ~ p ) • 
To find the maximum, set the first derivative equal to zero to obtain 
d [InL (p)] 



{P)^ + { n -t x )(^ 



0, 



dp \^i J \PJ V tt J ^-P* 

which is the same as previous equation. Thus the solution is p = x and the maximum likelihood estimator 

for p is P= X. 

Motivated by the preceding illustration, the formal definition of maximum likelihood estimators is pre- 
sented. This definition is used in both the discrete and continuous cases. In many practical cases, these 
estimators (and estimates) are unique. For many applications there is just one unknown parameter. In this 
case the likelihood function is given by 



L(e) = Hf( Xi ,< 



note: Maximum Likelihood Estimation - Examples (Section 3.6.1: MAXIMUM LIKELIHOOD 
ESTIMATION - EXAMPLES) 



3.6 Maximum Likelihood Estimation - Examples 6 

3.6.1 MAXIMUM LIKELIHOOD ESTIMATION - EXAMPLES 
3.6.1.1 EXPONENTIAL DISTRIBUTION 

Let Xi,X2, ...,X n be a random sample from the exponential distribution with p.d.f. 

f(x;9)= le- x/e ,0<x <oo,6»efi= {9; < 6 < oo}. 

u 

The likelihood function is given by 



6 This content is available online at <http://cnx.Org/content/ml3500/l.3/>. 
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L(9) = L(9;x 1 ,x 2 ,...,x n ) = ( V- /') (±e—») ■ ■ ■ (V*^ = ^exp f £ jf**' ) ,0 < tf < x . 



The natural logarithm of L (9) is 

1 " 
InL (0) = - (n) In (0) - - ^ Xj, < 9 < oo. 

i=l 

Thus, 

d[lnL(0)] _ -n ElLi gj = n 
dd 9 9 2 

The solution of this equation for 9 is 

1 ™ 



Note that, 



n 

8 = 1 



£E!|ffll.^_ + * , , > o,,<,, 



Hence, InL (6*) does have a maximum at x, and thus the maximum likelihood estimator for 9 is 



n * — J 



n 
i=l 

This is both an unbiased estimator and the method of moments estimator for 9. 

3.6.1.2 GEOMETRIC DISTRIBUTION 

Let X\,X2, ■■■jXn be a random sample from the geometric distribution with p.d.f. 

f(x;p) = (1 -p) x ~ p,x = 1,2,3,.... 
The likelihood function is given by 

L (p) = (1 - pf^pil - pT~ X P •••(!- pf n ' X V = P"(l - P) 2 ^-", < p < 1. 
The natural logarithm of L (0) is 

/ n \ 

InL (p) = nlnp + \^ Xj — n I In (1 — p) , < p < 1. 

Thus restricting p to < p < 1 so as to be able to take the derivative, we have 

cflnL (p) _ n _ YJi=i Xj-n _ 
dp p 1 — p 
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Solving for p, we obtain 



So the maximum likelihood estimator of p is 



En — • 

i=l x i X 



Again this estimator is the method of moments estimator, and it agrees with the intuition because, in n 
observations of a geometric random variable, there are n successes in the 5LT=i Xi trials. Thus the estimate 
of p is the number of successes divided by the total number of trials. 

3.6.1.3 NORMAL DISTRIBUTION 

Let Xi,X 2 , .-.,X n be a random sample from N {61,62), where 

n = {{61,62) ■■ -00 < 61 < 00,0 < e 2 < 00). 

That is, here let 61 = /x and 6 2 = <r 2 . Then 



n / 



\ yv^e^ 



exp 



{Xj - 61Y 
26 2 



or equivalently, 



L{6i, 



1 



exp 



,^/2W 2 
The natural logarithm of the likelihood function is 



26 2 



n,c2 



g n. 



InL {61,62) 



-In (27r6» 2 ) 



EIU (gj - gO 

26o 



The partial derivatives with respect to 61 and 6 2 are 

9 (InL) 1 



001 



E(^-^i) 



and 



(InL) — n 
06*2 ~ 202~ ' 2fl: 



1 " 

-2 J2 & - °i) 



The equation 



d(ln&) 



has the solution #1 = a;. Setting 
1 " 

T( ^ * 



d(lnL) 

ae 2 

2 



and replacing 9i by x yields 



X{ X ) 



i=l 



By considering the usual condition on the second partial derivatives, these solutions do provide a maxi- 
mum. Thus the maximum likelihood estimators 



and 



H = 9i 



a =62 
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6i=X 
and 

Where we compare the above example with the introductory one, we see that the method of moments 
estimators and the maximum likelihood estimators for \i and a 2 are the same. But this is not always the case. 
If they are not the same, which is better? Due to the fact that the maximum likelihood estimator of 9 has 
an approximate normal distribution with mean 9 and a variance that is equal to a certain lower bound, thus 
at least approximately, it is unbiased minimum variance estimator. Accordingly, most statisticians prefer 
the maximum likelihood estimators than estimators found using the method of moments. 

3.6.1.4 BINOMIAL DISTRIBUTION 

Observations: k successes in n Bernoulli trials. 



77 ' 

x\ (n — x)\ 



L( P ) =Uf(xi) = n , "• v p"(i-p)"-" = n ,/' m ^ -pr~^ ixi 

fJl ±±\Xil{n-Xiy. ) \f^ Xil {n - Xi)\ J 

n / n \ 

InL (p) = y. Xilnp + \ n — y. £Ci I In (1 — p) 

i=\ \ »=i / 

rflnL (p) 1 " / 






dp P ^ y z^ j !- p 



p 1- p 







^ Xi- P ^2 Xi - n P + ^2 x% P= 



V 



i=l i=l 

2_ii=l ^i * 



n n 



3.6.1.5 POISSON DISTRIBUTION 

Observations: aji, x 2 , ..., x n , 



A x e~ A 
/(*)= ;— ,3 = 0,1,2, 



MA) = ft (*£) = e--^^ 
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lnl, (A) = -An + ^ x 4 lnA - In I JJ 



X; 



dl __ 

d\~~ U "' ^"'A 



" 1 

5>T 



-n + 2 ^iT = 

i=l 



3.7 ASYMPTOTIC DISTRIBUTION OF MAXIMUM LIKELI- 
HOOD ESTIMATORS 7 

3.7.1 ASYMPTOTIC DISTRIBUTION OF MAXIMUM LIKELIHOOD ESTI- 
MATORS 

Let consider a distribution with p.d.f. / (x; 0) such that the parameter is not involved in the support of 

the distribution. We want to be able to find the maximum likelihood estimator by solving 

[InL (0)] 



00 



0, 



where here the partial derivative was used because L{0) involves X\,X2, ■■■,x r , 
That is, 







InL 



00 
where now, with in this expression, 



0. 



L\e\ =f\x 1 -e\f\x 2 -e\ ■■■f\x n -0\. 

We can approximate the left-hand member of this latter equation by a linear function found from the 
first two terms of a Taylor's series expanded about , namely 

d[\nL{0)} I- \ Qi{lnL(0)} 

00 + [ e ~ e ) —W °' 

when L(0) = f (X i; 0) f (X 2 ; 0) ■ ■ ■ f (X n ;0) . 

Obviously, this approximation is good enough only if is close to 0, and an adequate mathematical proof 
involves those conditions. But a heuristic argument can be made by solving for — to obtain 

9[lnL(fl)] 

(3.1) 



d 2 [lnL(8)] 



7 This content is available online at <http://cnx.Org/content/ml3527/l.2/>. 
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Recall that 
and 



lnL(0)=ln/(X i; 0)+ln/(X 2 ;i 



d\uL{9) _ yv d [In/ {Xi- 9)] 



M(x n; e) 



oe 



mi 



(3.2) 



The expression (2) is the sum of the n independent and identically distributed random variables 

a[in/(jr i; 0)] . 

y, = , i = l,l, ..., n. 

89 

and thus the Central Limit Theorem has an approximate normal distribution with mean (in the continuous 
case) equal to 



/ ®*&®fW)dx= J ^±«^= / «^ 



J f(xi;6)dx 



dd 



[1] = 0. 



(3.3) 



Clearly, the mathematical condition is needed that it is permissible to interchange the operations of inte- 
gration and differentiation in those last steps. Of course, the integral of / (xf, 9) is equal to one because it 
is a p.d.f. 

Since we know that the mean of each Y is 



[In/ fo; 9)} 
39 



f{x;9)dx = Q 



let us take derivatives of each member of this equation with respect to 9 obtaining 



d 2 [lnf( Xi ;9)} d[lnf( Xi ;9)]d[f(x t ;9)] _ 

i aa2 J\ x ,v)-r aa aa $ax — u. 



09 2 



09 



09 



However, 



so 



d[f(xi-,0)] d[lnf( Xi ;t 



09 

2 



09 



[ f(x;9) 



{ d -^p^}f(x;9 )d x 



2 [lnf( Xi ;9)} 
09 2 



f(xi]9)dx. 



Since E (Y) = 0, this last expression provides the variance of Y = [In/ (X; 9)] /dd. Then the variance 
of expression (2) is n times this value, namely 



-nE{ 



2 [In/ ( Xi ; I 
09 2 



-}■ 



Let us rewrite (1) (3.1) as 



V^\9-9 



d[\nL(6)]/d0 



l-^-E{02[lnf(X;9)}/092} 



^/-E{d 2 [\nf(X:6)]/d0 2 } 

1 8 2 []nL(0)] 

E{-d 2 [\nf(X;8)]/d6 2 } 



(3.4) 
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The numerator of (4) has an approximate N (0, 1) distribution; and those unstated mathematical condition 
require, in some sense for — - — ^^ to converge to E [— d 2 [In/ (X; 9)] /dd 2 ] . Accordingly, the ratios given 

in equation (4) must be approximately AT (0,1) . That is, 9 has an approximate normal distribution with 
mean 9 and standard deviation ' 



y/-nE{d 2 [\nf(X;0)]/d9 2 }' 



3.7.1.1 



Example 3.6 

With the underlying exponential p.d.f. 



f(x;0)= ]-e~ x/8 ,0<x<oo,d en= {9; < 9 < oo}. 



X is the maximum likelihood estimator. Since In/ (x; 9) = — h\9 — | and ao' = ~~ \ + W an( ^ 

1 2x 

e 2 e 3 ■ 



9 2 [ln/(x;0)l 1 2x u 

— — ■ — ", we have 



-E 



1 2X 





ln(9- 


X 




and 


d[lnf(x 
06 


■■(>)} 


2(9 


1 










6» :! 


" e 2 











because E (X) = 0. That is, X has an approximate distribution with mean 9 and standard 
deviation 9/y/n. Thus the random interval X± 1.96 (9/y/n) has an approximate probability of 0.95 
for covering 9. Substituting the observed x for 9 , as well as for X , we say that x ± 1.96x/y/n is 
an approximate 95% confidence interval for 9. 

Example 3.7 

The maximum likelihood estimator for A in 

\ x e~ x 

f(x;X) = — ,x = O,l,2,...;0e O = {6 : < 9 < oo} 

x\ 

is A= X Now In/ (x; A) = xlnA - A - lnx! and 8[ln ^ ;A)1 = f - 1 and ^MpM = jL. Thus 

— E (— T5-) = xs" = X an( ^ ^ = "^ ^ as an a PP rox i ma te normal distribution with mean A and standard 
deviation y/X/n. Finally x ± 1.645^/x/n serves as an approximate 90% confidence interval for A. 
With the data from example(. . .) x = 2.225 and hence this interval is from 1.887 to 2.563. 



3.7.1.2 

It is interesting that there is another theorem which is somewhat related to the preceding result in that the 

variance of 8 serves as a lower bound for the variance of every unbiased estimator of 9 . Thus we know that 
if a certain unbiased estimator has a variance equal to that lower bound, we cannot find a better one and 
hence it is the best in the sense of being the unbiased minimum variance estimator. This is called the 
Rao- Cramer Inequality. 

Let X\,X2, ■■■jXn be a random sample from a distribution with p.d.f. 

f{x;8) 1 9en = {9:c<9<d}, 

where the support X does not depend upon #so that we can differentiate, with respect to 8, under integral 
signs like that in the following integral: 
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f(x;6)dx= 1. 

If Y = u (Xi, X2, ..., X n ) is an unbiased estimator of 0, then 

1 



-1 



Var{Y)>— 

n J {[din/ (x; 0) /d0]} 2 f {x; 0) dx n'j [d 2 lnf (ar; 0) /d0 2 ] f {x; 0) dx 

—00 —00 

Note that the two integrals in the respective denominators are the expectations 



and 



E{ 



E 



dlnf (X; 0) 
d0 



d 2 lnf (X; , 



} 



sometimes one is easier to compute that the other. 

Note that above the lower bound of two distributions: exponential and Poisson was computed. Those 
respective lower bounds were 2 n and An. Since in each case, the variance of X equals the lower bound, 
then X is the unbiased minimum variance estimator. 

3.7.1.3 

Example 3.8 

The sample arises from a distribution with p.d.f. 



We have 



/ (x; 0) = 0x e - l ,O < x < 1, € n = {0 : < 9 < 00}. 



B\r\f(r-0) 1 

In/ (x; 0) = \u9 + (0 - 1) lnx, J ^ ' ; = - + lux, 

O0 



and 



d 2 \nf{x-0) _ 1 
W 2 ~~¥ 



Since E (-1/0 2 ) = —1/0 2 , the lower bound of the variance of every unbiased estimator of is 
2 /n. Moreover, the maximum likelihood estimator 

~. n 

e=-n/\n\\X l 
i=i 

has an approximate normal distribution with mean and variance 2 /n. Thus, in a limiting sense, 

is the unbiased minimum variance estimator of 0. 



3.7.1.3.1 

To measure the value of estimators; their variances are compared to the Rao-Cramer lower bound. The ratio 
of the Rao-Cramer lower bound to the actual variance of any unbiased estimator is called the efficiency of 
that estimator. As estimator with efficiency of 50% requires that 1/0.5=2 times as many sample observations 
are needed to do as well in estimation as can be done with the unbiased minimum variance estimator (then 
100% efficient estimator). 
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Chapter 4 

Tests of Statistical Hypotheses 



4.1 TEST ABOUT PROPORTIONS 1 
4.1.1 TEST ABOUT PROPORTIONS 

Tests of statistical hypotheses are a very important topic, let introduce it through an illustration. 

4.1.1.1 

Suppose a manufacturer of a certain printed circuit observes that about p=0.05 of the circuits fails. An 
engineer and statistician working together suggest some changes that might improve the design of the product. 
To test this new procedure, it was agreed that n=100 circuits would be produced using the proposed method 
and the checked. Let Y equal the number of these 200 circuits that fail. Clearly, if the number of failures, Y, 
is such that Y/200 is about to 0.05, then it seems that the new procedure has not resulted in an improvement. 
On the other hand, If Y is small so that Y/200 is about 0.01 or 0.02, we might believe that the new method 
is better than the old one. On the other hand, if Y/200 is 0.08 or 0.09, the proposed method has perhaps 
caused a greater proportion of failures. What is needed is to establish a formal rule that tells when to accept 
the new procedure as an improvement. For example, we could accept the new procedure as an improvement 
if Y < 5 of Y/n < 0.025. We do note, however, that the probability of the failure could still be about p=0.05 
even with the new procedure, and yet we could observe 5 of fewer failures in n=200 trials. 

That is, we would accept the new method as being an improvement when, in fact, it was not. This 
decision is a mistake which we call a Type I error. On the other hand, the new procedure might actually 
improve the product so that p is much smaller, say p=0.02, and yet we could observe y=7 failures so that 
y/200=0.035. Thus we would not accept the new method as resulting in an improvement when in fact it 
had. This decision would also be a mistake which we call a Type II error. 

If it we believe these trials, using the new procedure, are independent and have about the same probability 
of failure on each trial, then Y is binomial b(200,p). We wish to make a statistical inference about p using 

the unbiased P= Y/200. We could also construct a confidence interval, say one that has 95% confidence, 
obtaining 



V ± 



1.96\ 



P 1- P 



200 



This inference is very appropriate and many statisticians simply do this. If the limits of this confidence 
interval contain 0.05, they would not say the new procedure is necessarily better, al least until more data 
are taken. If, on the other hand, the upper limit of this confidence interval is less than 0.05, then they fell 



lr This content is available online at <http://cnx.Org/content/ml3525/l.2/>. 

63 



64 CHAPTER 4. TESTS OF STATISTICAL HYPOTHESES 

95% confident that the true p is now less than 0.05. Here, in this illustration, we are testing whether or not 
the probability of failure has or has not decreased from 0.05 when the new manufacturing procedure is used. 

The no change hypothesis, H n : p = 0.05, is called the null hypothesis. Since H : p = 0.05 completely 
specifies the distribution it is called a simple hypothesis; thus Ho : p = 0.05 is a simple null hypothesis. 

The research worker's hypothesis H\ : p < 0.05 is called the alternative hypothesis. Since H\ : p < 
0.05 does not completely specify the distribution, it is a composite hypothesis because it is composed of 
many simple hypotheses. 

The rule of rejecting Hq and accepting H i if Y < 5, and otherwise accepting Ho is called a test of a 
statistical hypothesis. 

It is clearly seen that two types of errors can be recorded 

• Type I error: Rejecting Ho and accepting H\, when Ho is true; 

• Type II error: Accepting H when Hi is true, that is, when H is false. 

Since, in the example above, we make a Type I error if Y < 5 when in fact p=0.05. we can calculate the 
probability of this error, which we denote by a and call the significance level of the test. Under an 
assumption, it is 

5 / 200 \ 

a = P(Y <5;p=0.05) = ^ (0.05) !/ (0.95) 200 " 2/ . 

y=o\y J 

Since n is rather large and p is small, these binomial probabilities can be approximated extremely well 
by Poisson probabilities with A = 200 (0.05) = 10. That is, from the Poisson table, the probability of the 
Type I error is 

A lO^e- 10 
a « > - = 0.067. 

y=Q J 

Thus, the approximate significance level of this test is a = 0.067. This value is reasonably small. However, 
what about the probability of Type II error in case p has been improved to 0.02, say? This error occurs if 
Y > 5 when, in fact, p=0.02; hence its probability, denoted by p, is 

200 / 200 \ 
P = P (Y > 5;p = 0.02) = Y) (0.02) y (0.98) 200 ^. 

y=e \ y J 
Again we use the Poisson approximation, here A=200(0.02)=4, to obtain 

A 4^e- 4 
/?«l-> — = 1-0.785 = 0.215. 

y=o y 

The engineers and the statisticians who created this new procedure probably are not too pleased with 
this answer. That is, they note that if their new procedure of manufacturing circuits has actually decreased 
the probability of failure to 0.02 from 0.05 (a big improvement), there is still a good chance, 0.215, that 
Ho: p=0.05 is accepted and their improvement rejected. Thus, this test of Ho: p=0. 05 against Hi: p=0. 02 is 
unsatisfactory. Without worrying more about the probability of the Type II error, here, above was presented 
a frequently used procedure for testing Ho: p=Po, where p is some specified probability of success. This 
test is based upon the fact that the number of successes, Y, in n independent Bernoulli trials is such that 
Y/n has an approximate normal distribution, N[p , p (l- p )/n], provided Ho: p=Po is true and n is large. 
Suppose the alternative hypothesis is Ho: p>Po ; that is, it has been hypothesized by a research worker 
that something has been done to increase the probability of success. Consider the test of Ho: p=Po against 
Hi: p> p that rejects Hq and accepts Hi if and only if 
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7= Y ' n - P » = > Za 

VPo(l ~Po)/n 

That is, if Y/n exceeds p by standard deviations of Y/n, we reject H and accept the hypothesis 
Hi: p> p . Since, under Ho Z is approximately N (0,1), the approximate probability of this occurring when 
Ho: p=Po is true is a. That is the significance level of that test is approximately a. If the alternative is 
Hi: p< p instead of Hi: p> p , then the appropriate a-level test is given by Z < —z a . That is, if Y/n is 
smaller than p by standard deviations of Y/n, we accept Hi: p< p . 

In general, without changing the sample size or the type of the test of the hypothesis, a decrease in a 
causes an increase in (3, and a decrease in f3 causes an increase in a. Both probabilities a and (3 of the two 
types of errors can be decreased only by increasing the sample size or, in some way, constructing a better 
test of the hypothesis. 

4.1.1.1.1 EXAMPLE 

If n=100 and we desire a test with significance level a=0.05, then a = P (X > c; /i = 60) = 0.05 means, 
since X is N(/z,100/100=l), 

/Z-60 c-60 \ 

P I > ; fjt = 60 J = 0.05 

and c — 60 = 1.645. Thus c=61.645. The power function is 

K (/i) = P (X > 61.645; pi) = P (^j-^ > 6L645 ~ *0 = 1 -$ (61.645 - n) . 

In particular, this means that (3 at /z=65 is 

= 1-K(ti) = § (61.645 - 65) = $ (-3.355) w 0; 

so, with n=100, both a and have decreased from their respective original values of 0.1587 and 0.0668 when 
n=25. Rather than guess at the value of n, an ideal power function determines the sample size. Let us use 
a critical region of the form x > c. Further, suppose that we want a=0.025 and, when ^=65, /3=0.05. Thus, 
since X is N(/x,100/n), 

0.025 = P (X > c; u = 60) = 1 - 3> f C ~ = 

\w/y/n 

and 

/ c — 65 

0.05 = 1 - P (X > c; u = 65) = $ — --= 

That is, r^= = 1.96 and ^=^= = -1.645. 

Solving these equations simultaneously for c and 10/y^n, we obtain 

c = 60 + 1.96 = 62.718; 

3.605 

10 5 



V« 3.605 

Thus, ^/n = 7.21 and n = 51.98. Since n must be an integer, we would use n=52 and obtain a=0.025 
and (3=0.05, approximately. 
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4.1.1.1.2 

For a number of years there has been another value associated with a statistical test, and most statistical 
computer programs automatically print this out; it is called the probability value or, for brevity, p- value. 
The p- value associated with a test is the probability that we obtain the observed value of the test statistic 
or a value that is more extreme in the direction of the alternative hypothesis, calculated when Ho is true. 
Rather than select the critical region ahead of time, the p-value of a test can be reported and the reader 
then makes a decision. 

4.1.1.1.2.1 

Say we are testing Ho: /i=60 against Hi: /t>60 with a sample mean X based on n=52 observations. Suppose 
that we obtain the observed sample mean of x = 62.75. If we compute the probability of obtaining an x of 
that value of 62.75 or greater when /i=60, then we obtain the p-value associated with x = 62.75. That is, 

p- t^tie = P(X> 62.75;/, = 60) =p(^><*^;/, = 60) 

= 1 - $ f 6 ^'^! ) = 1 - * (1.983) = 0.0237. 

If this p-value is small, we tend to reject the hypothesis Ho: /x=60 . For example, rejection of Ho: /t=60 
if the p-value is less than or equal to 0.025 is exactly the same as rejection if x = 62. 718. That is, x = 62.718 
has a p-value of 0.025. To help keep the definition of p-value in mind, we note that it can be thought of 
as that tail-end probability, under H , of the distribution of the statistic, here X, beyond the observed 
value of the statistic. See Figure 1 (Figure 4.1) for the p-value associated with x = 62.75. 
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Figure 4.1: The p-value associated with x — 62.75. 



Example 4.1 

Suppose that in the past, a golfer's scores have been (approximately) normally distributed with 
mean /i=90 and cr 2 =9. After taking some lessons, the golfer has reason to believe that the mean /i 
has decreased. (We assume that a 1 is still about 9.) To test the null hypothesis Ho: /i=90 against 
the alternative hypothesis Hj: fj, < 90 , the golfer plays 16 games, computing the sample mean x.li 
x is small, say x < c, then H is rejected and Hi accepted; that is, it seems as if the mean fj, has 
actually decreased after the lessons. If c=88.5, then the power function of the test is 



K (») = P (X <88.5;n) = P (^j^ < ^ % 



<I> 



1.5 - n 

3/4 



3/4 " 3/4 
Because 9/16 is the variance of X. In particular, 

a = K (90) = $ (-2) = 1 - 0.9772 = 0.0228. 

If, in fact, the true mean is equal to fi=88 after the lessons, the power is K (88) = $ (2/3) 
0.7475. If A*=87, then K (87) = $ (2) = 0.9772. An observed sample mean of x = 88.25 has a 



p - value = P (X < 88.25; ^ = 90)=$ 



.25 - 90 

~374 



<I> 



0.0098, 



68 CHAPTER 4. TESTS OF STATISTICAL HYPOTHESES 

and this would lead to a rejection at a=0.0228 (or even a=0.01). 

4.2 TESTS ABOUT ONE MEAN AND ONE VARIANCE 2 

4.2.1 TESTS ABOUT ONE MEAN AND ONE VARIANCE 

In the previous paragraphs it was assumed that we were sampling from a normal distribution and the variance 
was known. The null hypothesis was generally of the form Ho: ii= /io- 

There are essentially tree possibilities for the alternative hypothesis, namely that li has increased, 

1. Hi: ll > /io! A* nas decreased, 

2. Hi: li < /io! /•* has changed, but it is not known if it has increased or decreased, which leads to a 
two-sided alternative hypothesis 

3. Hi;/j,^ no- 

To test Ho; [i = /io against one of these tree alternative hypotheses, a random sample is taken from the 
distribution, and an observed sample mean, x, that is close to /to supports Ho- The closeness of x to /io is 
measured in term of standard deviations of X, a/^/n which is sometimes called the standard error of the 
mean. Thus the statistic could be defined by 

r 7 _X-no_X-no 

— — / — — — '/ — i^i 
\Ja2jn cr/Vn 

and the critical regions, at a significance level a, for the tree respective alternative hypotheses would be: 

1. z > z a 

2- z < z a 
3. \z\ = z a/2 

In terms of x these tree critical regions become 

1. x > /t + z a a/^n, 
2- x < /to - z a a/y/n, 
3. \x — /t | > z a ajyfn 



4.2.1.1 

These tests and critical regions are summarized in TABLE 1 (Table 4.1: TABLE 1). The underlying 
assumption is that the distribution is N (/z, a 2 ) and a 2 is known. Thus far we have assumed that the 
variance a 2 was known. We now take a more realistic position and assume that the variance is unknown. 
Suppose our null hypothesis is Ho; /j, = /io an d the two-sided alternative hypothesis is H\; /j, ^ /to- If a random 
sample Xi,X 2 , ■■■■,X n is taken from a normal distribution N (/i,cr 2 ),let recall that a confidence interval for 
\x was based on 

T _ X-jX _ X-li 
^fS 2 }^ S/y/n' 

TABLE 1 



2 This content is available online at <http://cnx.Org/content/ml3526/l.3/>. 
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Ho 
M = Mo 


H x 


Critical Region 






M > Mo 


z>z a ori>/(|) + 
z a ul\fn 






M = Mo 


M < Mo 


z < — z a or x < 
Ho - z a (j/,Jn 






M = Mo 


M7^ Mo 


\z\ > z a/2 or \x - 
Mol > z a /2<?/Vn 







Table 4.1 



This suggests that T might be a good statistic to use for the test Ho; H = Mo with h replaced by 
Ho. In addition, it is the natural statistic to use if we replace a 2 /n by its unbiased estimator S 2 /n in 
(X — no) j ' \j 'a 2 /n in a proper equation. If /i = no we know that T has a t distribution with n-1 degrees of 
freedom. Thus, with /i = ho, 



P[\T\>t a/2 (n-l)} =P 



\X-ho\ 



> 



t a /2(n- 1) 



Accordingly, if x and s are the sample mean and the sample standard deviation, the rule that rejects 
Ho ; ix = Ho if an d only if 

\x- Ho\ 



t 



> t a / 2 (n- 1). 



sj\fn 

Provides the test of the hypothesis with significance level a. It should be noted that this rule is equivalent 
to rejecting Ho; H = Mo if Ho ls n °t in the open 100 (1 — a) % confidence interval 

(x - t a / 2 (n - 1) s/^/n, x + t a/2 (n-1) s/^/n) . 

Table 2 (Table 4.2: TABLE 2) summarizes tests of hypotheses for a single mean, along with the 
three possible alternative hypotheses, when the underlying distribution is A r (/i,cr 2 ), a 2 is unknown, 
t = (x — Ho) I { S /Vn) an d n < 31. If n>31, use table 1 (Table 4.1: TABLE 1) for approximate tests 
with a replaced by s. 

TABLE 2 



Ho 

M = Mo 


#1 


Critical Region 






M > Mo 


t > t a (n-l) 
or a; > Ho + 
t a (n- l)s/y/n 






M = Mo 


M < Mo 


t < -t a (n-l) 
or a; < Ho — 
t a (n- l)s/y/n 






continued on next page 
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M = Mo 


M7^M0 


1*1 > t a/2 (n-l) 
or |i - (Uo| > 
t a / 2 {n- l)s/y/n 







Table 4.2 

Example 4.2 

Let X (in millimeters) equal the growth in 15 days of a tumor induced in a mouse. Assume that 
the distribution of X is A'" (/i, a 2 ). We shall test the null hypothesis H : /i = /i = 4.0 millimeters 
against the two-sided alternative hypothesis is Hi : /i ^ 4.0. If we use n=9 observations and a 
significance level of a =0.10, the critical region is 



* 



|j-4.0| 



> 



t a/2 (8) = t . 05 (8) = i.mo. 



If we are given that n=9, x=4.3, and s=1.2, we see that 

4.3-4.0 0.3 
04 



t 



1.2/V9 



0.75. 



Thus \t\ = |0.75| < 1.860 and we accept (do not reject) Ho : /x = 4.0 at the a=10% significance 
level. See Figure 1 (Figure 4.2). 



0.4 



0.3- 



0.2 ■ 



0.1 ■ 




aifarf2"0.05 



Figure 4.2: Rejection region at the a = 10% significance level. 
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note: In discussing the test of a statistical hypothesis, the word accept might better be replaced 
by do not reject. That is, in Example 1 (Example 4.2), x is close enough to 4.0 so that we accept 
/i=4.0, we do not want that acceptance to imply that /j, is actually equal to 4.0. We want to say 
that the data do not deviate enough from /i=4.0 for us to reject that hypothesis; that is, we do not 
reject ^=4.0 with these observed data, With this understanding, one sometimes uses accept and 
sometimes fail to reject or do not reject, the null hypothesis. 



4.2.1.2 

In this example the use of the t-statistic with a one-sided alternative hypothesis will be illustrated. 
Example 4.3 

In attempting to control the strength of the wastes discharged into a nearby river, a paper firm 
has taken a number of measures. Members of the firm believe that they have reduced the oxygen- 
consuming power of their wastes from a previous mean /x of 500. They plan to test Ho : fi = 500 
against H\ : /j, < 500, using readings taken on n=25 consecutive days. If these 25 values can be 
treated as a random sample, then the critical region, for a significance level of a=0.01, is 

x — 500 
t = < -io.oi (24) = -2.492. 

S/V25 

The observed values of the sample mean and sample standard deviation were x=308.8 and 

s=115.15. Since 

308.8 - 500 

t = 1= = -8.30 < - 2.492. 

115.15/V25 

we clearly reject the null hypothesis and accept H\ : /j, < 500. It should be noted, however, 
that although an improvement has been made, there still might exist the question of whether the 
improvement is adequate. The 95% confidence interval 308.8 ± 2.064 (115.15/5) or [261.27, 356.33] 
for fj, might the company answer that question. 



4.2.2 

4.3 TEST OF THE EQUALITY OF TWO INDEPENDENT NOR- 
MAL DISTRIBUTIONS 3 

4.3.1 TEST OF THE EQUALITY OF TWO INDEPENDENT NORMAL DIS- 
TRIBUTIONS 

Let X and Y have independent normal distributions N (/x x , c^) and TV (/x y , <jy), respectively. There are times 
when we are interested in testing whether the distribution of X and Y are the same. So if the assumption 
of normality is valid, we would be interested in testing whether the two variances are equal and whether the 
two mean are equal. 

Let first consider a test of the equality of the two means. When X and Y are independent and normally 
distributed, we can test hypotheses about their means using the same t-statistic that was used previously. 
Recall that the t-statistic used for constructing the confidence interval assumed that the variances of X and 
Y are equal. That is why we shall later consider a test for the equality of two variances. 

Let start with an example and then let give a table that lists some hypotheses and critical regions. 



3 This content is available online at <http://cnx.Org/content/ml3532/l.2/>. 
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4.3.1.1 



Example 4.4 

A botanist is interested in comparing the growth response of dwarf pea stems to two different 
levels of the hormone indoeacetic acid (IAA). Using 16-day-old pea plants, the botanist obtains 
5-millimeter sections and floats these sections with different hormone concentrations to observe the 
effect of the hormone on the growth of the pea stem. 

Let X and Y denote, respectively, the independent growths that can be attributed to the 
hormone during the first 26 hours after sectioning for (0.5) (10) and (10) _ levels of concentration 
of IAA. The botanist would like to test the null hypothesis Ho : /j, x — fj, y = against the alternative 
hypothesis H\ : /j, x — fi y < 0. If we can assume X and Y are independent and normally distributed 
with common variance, respective random samples of size n and m give a test based on the statistic 



X-Y 



X -Y 



^ [ (n _ i) SI + (m - 1) S%] I (n + m - 2)} (1/n + 1/m) 



Spy/l/n + 1/ro 



where 



S P 



(n-l)S% + (m-l)S$ 



T has a t distribution with r = n + m — 2 degrees of freedom when Ho is true and the variances 
are (approximately) equal. The hypothesis Ho will be rejected in favor of Hi if the observed value 
of T is less than —t a (n+ m — 2). 



4.3.1.2 



Example 4.5 

In the example 1 (Example 4.4), the botanist measured the growths of pea stem segments, in 
millimeters, for n=ll observations of X given in the Table 1: 

Table 1 



0.8 


1.8 


1.0 


0.1 


0.9 1.7 


1.0 


1.4 


0.9 


1.2 


0.5 





Table 4.3 
and m=13 observations of Y given in the Table 2: 

Table 2 



1.0 


0.8 


1.6 


2.6 


1.3 1.1 


2.4 


1.8 


2.5 


1.4 


1.9 


2.0 


1.2 





Table 4.4 



For these data, x = 1.03, s\ = 0.24, y = 1.66, and Sy = 0.35. The critical region for testing 



Ho '■ ^ x — (J-y = against Hi : \i x — \i y < is t < —t .05 (22) 
at a=0.05 significance level. 



-1.717. Since Hq is clearly rejected 
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4.3.1.3 



note: an approximate p- value of this test is 0.005 because — 1 .05 (22) = —2.819. Also, the sample 
variances do not differ too much; thus most statisticians would use this two sample t-test. 



4.4 BEST CRITICAL REGIONS 4 
4.4.1 BEST CRITICAL REGIONS 

In this paragraph, let consider the properties a satisfactory test should posses. 

4.4.1.1 

Definition 4.1: 

1. Consider the test of the sample null hypothesis Hq : 6 = 8q against the simple alternative 
hypothesis Hi : 6 = 6\. 

2. Let C be a critical region of size a; that is, a = P (C; 6>o). Then C is a best critical region of 
size a if, for every other critical region D of size a = P (D; 9o), we have that 

P(C;9 1 )>P(D;e 1 ). 



That is, when Hi : 9 = 0\ is true, the probability of rejecting Hq : 9 = 9o using the critical region C is 
at least as great as the corresponding probability using any other critical region D of size a. 

Thus a best critical region of size a is the critical region that has the greatest power among all critical 
regions for a best critical region of size a. The Neyman-Pearson lemma gives sufficient conditions for a 
best critical region of size a. 

4.4.1.2 

Theorem 4.1: Neyman-Pearson Lemma 

Let X\, X2, ..., X n be a random sample of size n from a distribution with p.d.f. / (x; 9), where #0 
and 9\ are two possible values of 9. 

Denote the joint p.d.f. of X\, X2, ■■■, X n by the likelihood function 

L(9) = L (9; Xl ,x 2 , ..., x n ) = f (an; 9) f (x 2 ;9) • • • / (x n - 9) . 

If there exist a positive constant k and a subset C of the sample space such that 

1. P[(X 1 ,X 2 ,...,X n )eC;9 ] = a, 
2- j^ <kior (xi,X2,-,x n ) &C, 



L 



ffi > fcfor (xi,X2,...,x n ) e C". 



Then C is a best critical region of size a for testing the simple null hypothesis Hq : 
against the simple alternative hypothesis Hi : = 6\. 
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4 This content is available online at <http://cnx.Org/content/ml3528/l.2/>. 
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4.4.1.3 

For a realistic application of the Neyman-Pearson lemma, consider the following, in which the test is based 
on a random sample from a normal distribution. 

Example 4.6 

Let X\,X 2 , ...,X n be a random sample from a normal distribution N (fi, 36). We shall find the 

best critical region for testing the simple hypothesis H : fi = 50 against the simple alternative 

hypothesis H i : /u, = 55. Using the ratio of the likelihood functions, namely L (50) /L (55), we shall 

find those points in the sample space for which this ratio is less than or equal to some constant k. 

That is, we shall solve the following inequality: 

£(50) (72 7 r)—/ 2 cxp[-(^)i:r(^-50) 2 ] 

M55) (7 27r )-"/ 2 oxp[-(i) £» (zi-55) 2 ] 

= exp [- (^) (l0]T™x 4 + n50 2 -n55 2 )] < k. 
If we take the natural logarithm of each member of the inequality, we find that 

n 

-10^^-n50 2 + n55 2 < (72) lnfc. 

l 

Thus, 

1 ™ 1 

- V Xi > I"n50 2 - n55 2 + (72) Ink] 

n £-< lOn L y ' \ 

Or equivalently, x > c, where c = —jk^ [n50 2 — n55 2 + (72) lnfc] . 
Thus L (50) /L (55) < fc is equivalent to x > c. 
A best critical region is, according to the Neyman-Pearson lemma, 

C = {(x 1 ,x 2 ,...,x n ) : x> c}, 

where c is selected so that the size of the critical region is a. Say n=16 and c=53. Since X is 
iV (50,36/16) under Hq we have 



P(X > 53;/i = 50) =P 



X - 50 3 

> —-\\x= 50 



1 -$(2) = 0.0228. 



6/4 - 6/4 
The example 1 illustrates what is often true, namely, that the inequality 

L(9 1 )- k 
can be expressed in terms of a function u (xi, x 2 , ■•■, x n ) say, 

u(x 1 ,x 2 ,...,x n ) < ci 

or 

u(x 1 ,x 2 ,--.,x n ) > c 2 , 

where c\ and c 2 is selected so that the size of the critical region is a . Thus the test can be based on the 
statistic u(X\, ..., X n ). Also, for illustration, if we want a to be a given value, say 0.05, we would then 
choose our c\ and c 2 . In examplel, with a=0.05, we want 

Hence it must be true that (c - 50) / (3/2) = 1.645, or equivalently, c = 50 + | (1.645) w 52.47. 
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Example 4.7 

LetXi,X2, ...,X n denote a random sample of size n from a Poisson distribution with mean A. A 
best critical region for testing Ho : A = 2 against Hi : A = 5 is given by 

L(2) = 22>-e- 2 " x x \x 2 \---x n \ <fc 
L(5) ~ xi\x 2 l- ■ ■ x n l 52> i(3 -5n - 

The inequality is equivalent to (|) 'e 3n < k and (^ccj)ln (|) + 3n < Ink. 
Since In (2/5) < 0, this is the same as 

Elnk — 3n 
X; > ; — — — = c. 
_ x l ~ In (2/5) 

If n=4 and c=13, then 



P [ Yl Xl - 13; A = 2 ) =1 ~ °- 936 = °- 064 ' 



a=i 



.4 



from the tables, since J2i=i Xi has a Poisson distribution with mean 8 when A=2. 
When Hq : = 0q and H\ : 6 = 6\ are both simple hypotheses, a critical region of size a is a best critical 
region if the probability of rejecting Ho when Hi is true is a maximum when compared with all other critical 
regions of size a. The test using the best critical region is called a most powerful test because it has the 
greatest value of the power function at 9 = 9\ when compared with that of other tests of significance level 
a. If Hi is a composite hypothesis, the power of a test depends on each simple alternative in Hi . 

4.4.1.4.1 

Definition 4.2: 

A test, defined by a critical region C of size a, is a uniformly most powerful test if it is a most 
powerful test against each simple alternative in Hi. The critical region C is called a uniformly 
most powerful critical region of size a. 

Let now consider the example when the alternative is composite. 

Example 4.8 

Let Xi,X 2 , ..., X n be a random sample from N (/i, 36). We have seen that when testing Ho : fi = 50 
against Hi : \x = 55, a best critical region C is defined by 

C = {(x 1 ,X2,...,x n ) : x > c}, 

where c is selected so that the significance level is a. Now consider testing Hq : \x = 50 against 
the one-sided composite alternative hypothesis H\\ /i > 50. For each simple hypothesis in H\, say 
(X = /xi the quotient of the likelihood functions is 

£(50) = (72 7 r)-"/ 2 exp[-(.L) Y.i (zj-50) 2 ] 
L(Mi) (72 7 r)-"/ 2 c X p[-(i) £? (x,^,) 2 ] 

= exp [- (i) {2 (/ii - 50) YZ*i + n (50 2 - M 2 )}] ■ 



Now L (50) /L (/Ji) < k if and only if 



_ > (-72) In (k) 50 + mi = 

X ~ 2n(/ii -50) 2 
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Thus the best critical region of size a for testing Ho : (J, = 50 against H\ : \x = fii, where 
li\ > 50, is given by 



where is selected such that 



C = {(xi,X2,—,x n ) : x > c}, 
P (X > c; H a : fi = 50) = a. 



4.4.2 



note: the same value of c can be used for each \i\ > 50 , but of course k does not remain the 
same. Since the critical region C defines a test that is most powerful against each simple alternative 
Hi > 50, this is a uniformly most powerful test, and C is a uniformly most powerful critical region 
if size a. Again if a=0.05, then c w 52.47. 



4.5 HYPOTHESES TESTING 5 

4.5.1 Hypotheses Testing - Examples. 

Example 4.9 

We have tossed a coin 50 times and we got k = 19 heads. Should we accept /reject the hypothesis 
that p = 0.5, provided taht the coin is fair? 

Null versus Alternative Hypothesis: 

• Null hypothesis (H ) : p = 0.5. 

• Alternative hypothesis (Hi) : p ^ 0.5. 



5 This content is available online at <http://cnx.Org/content/ml3533/l.2/>. 
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Figure 4.3 



Significance level a = Probability of Type I error = Pr[rejecting H | H true] 
P[k < 18 or k > 32]< 0.05. 

If k < 18 or k > 32] < 0.05, then under the null hypothesis the observed event falls into rejection 
region with the probability a < 0.05. 



note: We want a as small as possible. 
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Figure 4.4: (a) Test construction, (b) Cumulative distribution function. 



note: No evidence to reject the null hypothesis. 



4.5.1.1 



Example 4.10 

We have tossed a coin 50 times and we got k = 10 heads. Should we accept /reject the hypothesis 
that p = 0.5, provided taht the coin is fair? 
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Figure 4.5: Cumulative distribution function. 



P[k < 10 or k > 40] ~ 0.000025. We could reject hypothesis _ffo a t a significance level as low 
as a = 0.000025. 



note: p-value is the lowest attainable significance level. 



note: In STATISTICS, to prove something = reject the hypothesis that converse is true. 



4.5.1.2 



Example 4.11 

We know that on average mouse tail is 5 cm long. We have a group of 10 mice, and give to each 
of them a dose of vitamin T everyday, from the birth, for the period of 6 months. 

We want to prove that vitamin X makes mouse tail longer. We measure tail lengths of out 
group and we get the following sample: 

Table 1 



5.5 


5.6 


4.3 


5.1 


5.2 6.1 


5.0 


5.2 


5.8 


4.1 





80 



CHAPTER 4. TESTS OF STATISTICAL HYPOTHESES 



Table 4.5 



• Hypothesis Hq - sample = sample from normal distribution with \i = 5 cm. 

• Alternative Hi - sample = sample from normal distribution with fj, > 5 cm. 



CONSTRUCTION OF THE TEST 



reject 




<0.95 



Cannot reject 



Figure 4.6 



We do not know population variance, and/or we suspect that vitamin treatment may change the 
variance - so we use t distribution (Section 2.5.1: THE t DISTRIBUTION). 



• X - jyEi=l^' 




• t- x ~>*Vn i. 


-xf 



Example 4.12 
X 2 test (K. Pearson, 1900) 

To test the hypothesis that a given data actually come from a population with the proposed 
distribution. Data is given in the Table 2 (Table 4.6: DATA). 
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DATA 



0.4319 


0.6874 


0.5301 


0.8774 


0.6698 


1.1900 


0.4360 


0.2192 


0.5082 




0.3564 


1.2521 


0.7744 


0.1954 


0.3075 


0.6193 


0.4527 


0.1843 


2.2617 




0.4048 


2.3923 


0.7029 


0.9500 


0.1074 


3.3593 


0.2112 


0.0237 


0.0080 




0.1897 


0.6592 


0.5572 


1.2336 


0.3527 


0.9115 


0.0326 


0.2555 


0.7095 




0.2360 


1.0536 


0.6569 


0.0552 


0.3046 


1.2388 


0.1402 


0.3712 


1.6093 




1.2595 


0.3991 


0.3698 


0.7944 


0.4425 


0.6363 


2.5008 


2.8841 


0.9300 




3.4827 


0.7658 


0.3049 


1.9015 


2.6742 


0.3923 


0.3974 


3.3202 


3.2906 




1.3283 


0.4263 


2.2836 


0.8007 


0.3678 


0.2654 


0.2938 


1.9808 


0.6311 




0.6535 


0.8325 


1.4987 


0.3137 


0.2862 


0.2545 


0.5899 


0.4713 


1.6893 




0.6375 


0.2674 


0.0907 


1.0383 


1.0939 


0.1155 


1.1676 


0.1737 


0.0769 




1.1692 


1.1440 


2.4005 


2.0369 


0.3560 


1.3249 


0.1358 


1.3994 


1.4138 




0.0046 


- 


- 


- 


- 


- 


- 


- 


- 



Table 4.6 



Problem 

Are these data sampled from population with exponential p.d.f.? 

Solution 

f(x)=e~ x . 
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CONSTRUCTION OF THE TEST 
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Caruiot reject 



(b) 
Figure 4.7 



4.5.1.3 



Exercise 4.1 

Are these data sampled from population with exponential p.d.f.? 



(Solution on p. 84.) 



4.5.1.4 



TABLE 1 



Actual 

Situj tfc>effion 


H true 


H false 


accept 


Reject = 


= error t. I 


reject 


Accept = error t. 
II 


continued on next page 
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probability 


1 — a 


a = significance 
level 


1 — j3 = power of 
the test 


P 



Table 4.7 



4.5.2 
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Solutions to Exercises in Chapter 4 

Solution to Exercise 4.1 (p. 82) 

f{x) = ae- ax . 



1. Estimate a. 

2. Use x 2 test. 

3. Remember d.f. = K-2. 



Chapter 5 

Pseudo - Numbers 

5.1 PSEUDO-NUMBERS 1 

5.1.1 UNIFORM PSEUDO-RANDOM VARIABLE GENERATION 

In this paragraph, our goals will be to look at, in more detail, how and whether particular types of pseudo- 
random variable generators work, and how, if necessary, we can implement a generator of our own choosing. 
Below a list of requirements is listed for our uniform random variable generator: 

1. A uniform marginal distribution, 

2. Independence of the uniform variables, 

3. Repeatability and portability, 

4. Computational speed. 

5.1.1.1 CURRENT ALGORITHMS 

The generation of pseudo-random variates through algorithmic methods is a mature field in the sense that 
a great deal is known theoretically about different classes of algorithms, and in the sense that particular 
algorithms in each of those classes have been shown, upon testing, to have good statistical properties. In 
this section, let describe the main classes of generators, and then let make specific recommendation about 
which generators should be implemented. 

5.1.1.1.1 

Congruential Generators 

The most widely used and best understood class of pseudo-random number generators are those based on 
the linear congruential method introduced by Lehmer (1951). Such generators are based on the following 
formula: 

Ui = (aUi-i + c) modm, (5-1) 

where Ui, i = 1,2, ... are the output random integers; Uq is the chosen starting value for the recursion, called 
the seed and a,c, and m are prechosen constants. 

note: to convert to uniform (0, 1) variates, we need only divide by modulus m, that is, we use 
the sequence {Ui/m} . 



1 This content is available online at <http://cnx.Org/content/ml3103/l.6/>. 
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The following properties of the algorithm are worth stating explicitly: 

1. Because of the "mod m" operation (for background on modular operations, see Knuth, (1981) ), the 
only possible values the algorithm can produce are the integers 0, 1, 2, ..., m — 1. This follows because, 
by definition, x mod m is the remainder after x is divided by m. 

2. Because the current random integer U{ depends only on the previous random integer £/j_i once a 
previous value has been repeated, the entire sequence after it must be repeated. Such a repeating 
sequence is called a cycle, and its period is the cycle length. Clearly, the maximum period of 
the congruential generator is m. For given choices of a, c, and m, a generator may contain many short 
cycles, (see the Example 1 below), and the cycle you enter will depend on the seed you start with. 
Notice that the generator with many short cycles is not a good one, since the output sequence will be 
one of a number of short series, each of which may not be uniformly distributed or randomly dispersed 
on the line or the plane. Moreover, if the simulation is long enough to cause the random numbers to 
repeat because of the short cycle length, the outputs will not be independent. 

3. If we are concern with a uniform (0, 1) variates, the finest partition of the interval (0, 1) that this 
generator can provide is [0, 1/m, 2/m, ..., (m — 1/m)]. This is, of course, not truly a uniform (0,1) 
distribution since, for any k in (0, m — 1) , we have P [k/m < U < (k + l)/m] = 0, not 1/m are 
required by theory for continuous random variables. 

4. Choices of a,c, and m, will determine not only the fineness of the partition of (0, 1) and the cycle length, 
and therefore, the uniformity of the marginal distribution, but also the independence properties of the 
output sequence. Properly choosing a,c, and m is a science that incorporates both theoretical results 
and empirical tests. The first rule is to select the modulus m to be "as large as possible", so that there 
is some hope to address point 3 above and to generate uniform variates with an approximately uniform 
marginal distribution. However, simply having m large is not enough; one may still find that the 
generator has many short cycles, or that the sequence is not approximately independent. See example 
1 (Example 5.1) below. 

Example 5.1 

Consider 

U, = 2£/ l _imod2 32 (5.2) 

Where a seed of the form 2 k creates a loop containing only integers that are powers of 2, or 

Ui = ([/ 4 _i + l)mod2 32 (5.3) 

which generates the nonrandom sequence of increasing integers. Therefore, the second equation 
gives a generator that has the maximum possible cycle length but is useless for simulating a random 
sequence. 

Fortunately, one a value of the m has been selected; theoretical results exist that give conditions for choosing 
values of the multiplier a and the additive constant c such that all the possible integers, through m — 1, 
are generated before any are repeated. 

note: this does not eliminate the second counterexample above, which already has the maximal 
cycle length, but is a useless random number generator. 



5.1.1.1.1.1 

THEOREM I 

A linear congruential generator will have maximal cycle length m, if and only if: 

• c is nonzero and is relatively prime to m (i.e., c and m have no common prime factors). 

• (amodg) = 1 for each prime factor q of m. 



(amod4) = 1 if 4 is a factor of m. 
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PROOF 

note: Knuth (1981, p. 16). 

As a mathematical note, c is called relatively prime to m if and only if c and m have no common divisor 
other than 1, which is equivalent to c and m having no common prime factor. 

A related result concerns the case of c chosen to be 0. This case does not conform to condition in a 
Theorem I (p. 86), a value Ui of zero must be avoided because the generator will continue to produce zero 
after the first occurrence of a zero. In particular, a seed of zero is not allowable. By Theorem I (p. 86), 
a generator with c = 0, which is called a multiplicative congruential generator, cannot have maximal 
cycle length m. However, By Theorem II (p. 87). It can have cycle length m — 1. 

THEOREM II 

If c = in a linear congruential generator, then Ui = can never be included in a cycle, since the will 
always repeat. However, the generator will cycle through all m — 1 integers in the set (amodg) if and only 
if: 

• m is a prime integer and 

• mis a primitive element modulo m . 

PROOF 

note: Knuth (1981, p. 19). 

A formal definition or primitive elements modulo m, as well as theoretical results for finding them, are given 
in Knuth (1981). In effect, when m is a prime, a is a primitive element if the cycle is of length m — 1. The 
results of Theorem II (p. 87) are not intuitively useful, but for our purposes, it is enough to note that such 
primitive elements exist and have veen computed by researchers, 

note: e.g., Table24.8 in Abramowitz and Stegun, 1965. 

Hence, we now must select one of two possibilities: 

• Choose a, c, and m according to Theorem I (p. 86) and work with a generator whose cycle length is 
known to be m. 

• Choose c = 0, take a and m according to Theorem II (p. 87), use a number other than zero as the 
seed, and work with a generator whose cycle length is known to be m — 1. A generator satisfying these 
conditions is known as a prime-modulus multiplicative congruential generator and, because of 
the simpler computation, it usually has an advantage in terms of speed over the mixed congruential 
generator. 

Another method frequency speeding up a random number generator that has c = is to choose the 
modulus m to be computationally convenient. For instance, consider m = 2 . This is clearly not a prime 
number, but on a computer the modulus operation becomes a bit-shift operation in machine code. In such 
cases, Theorem III gives a guise to the maximal cycle length. 

THEOREM III 

If c = and m = 2 k with k > 2, then the maximal possible cycle length is 2 fc_2 . This is achieved if and 
only if two conditions hold: 

• a is a primitive element modulo m. 

• the seed is odd. 

PROOF 

note: Knuth (1981, p. 19). 
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Notice that we sacrifice some of the cycle length and, as we will se in Theorem IV, we also lose some 
randomness in the low-order bits of the random variates. Having use any of Theorems I (p. 86), II (p. 87), 
or III (p. 87) to select triples (a, c, m) that lead to generators with sufficiently long cycles of known length, 
we can ask which triple gives the most random (i.e., approximately independent ) sequence. Although some 
theoretical results exist for generators as a whole, these are generally too weak to eliminate any but the 
worst generators. Marsaglia (1985) and Knuth(1981, Chap. 3.3.3) are good sources for material on 
that results. 

THEOREM IV 

If Ui = aUi-imod2 k , and we define 

Y t = C/ 4 mod2 J , < j <k (5.4) 

then 

Yi = aY i _ 1 mod2 j . (5.5) 

In practical terms, this means that the sequence of j-lo-order binary bits of the U sequence, namely Yi cycle 
with cycle length at most 2 J . In particular, sequence of the least significant bit (i.e., j=l) in (U\, U2, U3, ...) 
must behave as (0, 0, 0,0, ...) , (1, 1, 1, 1, ...) , (0, 1,0, 1, ...) or (1,0,1,0,...). 
PROOF 

note: Knuth (1981, pp. 12-14). 

Such normal behavior in the low-order bits of a congruential generator with non-prime-modulus m is an 
undesirably property, which may be aggravated by techniques such as the recycling of uniform variates. It 
has been observed (Hutchinson, 1966) that prime-modulus multiplicative congruential generators with 
full cycle (i.e., when m is a positive primitive element) tend to have fairly randomly distributed low-order 
bits, although no theory exists to explain this. 

THEOREM V 

If our congruential generator produces the sequence (U\, U2, ...), and we look at the following sequence 
of points in n dimensions: 

(C/i, U 2 , U 3 , ..., U n ) , (C/2, U 3 , U 4 , ..., U n+1 ) , (Us, U 4 , U 5 , ..., U n+2 ) , ... (5.6) 

then the points will all lie in fewer than (n\m) parallel hyper planes. 
PROOF 

NOTE: Marsaglia (1976). 

Given these known limitations of congruential generator, we are still left with the question of how to choose 
the "best" values for a, c, and m. To do this, researchers have followed a straightforward but time-consuming 
procedure: 

1. Take values a, c, and m that give a sufficiently long, known cycle length and usa the generator to 
produce sequences of uniform variates. 

2. Subject the output sequences to batteries of statistical tests for independence and a uniform marginal 
distribution. Document the results. 

3. Subject the generator to theoretical tests. In particular, the spectral test of Coveyou and MacPher- 
son (1967) is currently widely used and recognized as a very sensitive structural test for distinguishing 
between good and bad generators. Document the results. 

4. As new, more sensitive tests appear, subject to generator to those tests. Several such tests are discussed 
in Marsaglia(1985). 

note: Other Types of Generators 
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5.2 PSEUDO-RANDOM VARIABLE GENERATORS, cont. 2 
5.2.1 PSEUDO-RANDOM VARIABLE GENERATORS, cont. 

5.2.1.1 A Shift-Register Generator 

An alternative class of pseudo-numbers generators are shift-register or Tausworthe generators, which 
have their origins in the work of Golomb (1967). These algorithms operate on n-bit, pseudo-random binary 
vectors, just as congruential generators (p. 85) operate on pseudo-random integers. To return a uniform 
(0, 1) variate, the binary vector must be converted to an integer and divided by one plus the largest possible 
number, 2™. 

5.2.1.2 Fibonacci Generators 

The final major class of generators to be considered are the lagged Fibonacci generators, which take 
their name from the famous Fibonacci sequence £/, = U{-\ + U%-i- This recursion is reminiscent of the 
congruential generators, which the added feature that the current value depends on the two previous values. 
The integer generator based directly on the Fibonacci formula 

2" (5.7) 

has been investigated, but not found to be satisfactory random. A more general formulation can be given 
by the equation: 

U i = U i - r -U i - s ,r>l,s>l,r=£s, (5.8) 

where the symbol 'square' represents an arbitrary mathematical operation. We can think of the Ui = as 
either binary vectors, integers, or real numbers between and 1, depending on the operation involved. 

As examples: 

1. The Ui = are real and dot represents either mod 1 addition or subtraction. 

2. The Ui = are (n — 1) -bit integers and dot represents either mod 2™ addition, subtraction or multi- 
plication. 

3. The Ui = are binary vectors and dot represents any of binary addition, binary subtraction, exclusive- 
or addition, or multiplication. 

Other generators that generalize even further on the Fibonacci idea by using a linear combination of previous 
random integers to generate the current random integer are discussed in Knuth (1981, Chap 3.2.2). 

5.2.1.3 Combinations of Generators (Shuffling) 

Intuitively, it is tempting to believe that "combining" two sequences of pseudo-random variables will produce 
one sequence with better uniformity and randomness properties than either of the two originals. In fact, even 
though good congruential (p. 85), Tausworthe (Section 5.2.1.1: A Shift-Register Generator), and Fibonacci 
(Section 5.2.1.2: Fibonacci Generators) generators exist, combination generators may be better for a number 
of reasons. The individual generators with short cycle length can be combined intone with a very long cycle. 
This can be a great advantage, especially on computers with limited mathematical precision. These potential 
advantages have led to the development of a number of successful combination generators and research into 
many others. 

One of such generator, is a combination of three congruential generators, developed and tested by Wich- 
mann and Hill (1982). 

Another generator, Super-Duper, developed by G.Marsaglia, combines the binary form of the output 
form the multiplicative congruenatial generator with a multiplier a=69.069 and modulus m = 2 32 with the 



2 This content is available online at <http://cnx.Org/content/ml3104/l.4/>. 
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output of the 32-bit Tausworthe generator using a left-shift of 17 and a right shift of 15. This generator 
performs well, though not perfectly, and suffers from some practical drawbacks. 

A third general variation, a shuffled generator, randomizes the order in which a generator's variates are 
output. Specifically, we consider one pseudo-random variate generator that produces the sequence (Ui, U2, ■■■) 
of uniform (0,1) variates, and a second generator that outputs random integers , say between 1 and 16. 

The algorithm for the combined, shuffled generator is as follows: 

1. Set up a "table" in memory of locations 1 through 16 and store the values U\, U2, •••, E/16 sequentially 
in the table. 

2. Generate one value, V, between 1 and 16 from the second generator. 

3. Return the U variate from location V in the table as the desired output pseudo-random variate. 

4. Generate a new U variate and store it in the location V that was just accessed. 

5. If more random variates are desired, return to Step 2. 

note: the size of the table can be any value, with larger tables creating more randomness but 
requiring more memory allocation 

This method of shuffling by randomly accessing and filling a table is due to MacLaren and Marsaglia 
(1965). Another scheme, attributed to M.Gentlemanin Andrews et al. (1972), is to permute the table 
of 128 random numbers before returning them for use. The use of this type of combination of generators has 
also been described in the contexts of simulation problems in physics by Binder and Stauffer (1984). 

5.3 THE IVERSE PROBABILITY METHOD FOR GENERATING 
RANDOM VARIABLES 3 

5.3.1 THE IVERSE PROBABILITY METHOD FOR GENERATING RAN- 
DOM VARIABLES 

Once the generation of the uniform random variable (Section 5.1.1: UNIFORM PSEUDO-RANDOM VARI- 
ABLE GENERATION) is established, it can be used to generate other types of random variables. 

5.3.1.1 The Continuous Case 

THEOREM I 

Let X have a continuous distribution Fx (x), so that F^ 1 (a) exists for < a < 1 (and is hopefully 
countable). Then the random variable F^ 1 (U) has distribution Fx (x), U is uniformly distributed on (0,1). 
PROOF 

P (F x l (U) <x)=P (F X {F? (CO) < F X (x)) . (5.9) 

Because Fx (x) is monotone. Thus, 

P [F x l (CT) < x) = P (U < F x (x)) = F x (x) . (5.10) 

The last step follows because U is uniformly distributed on (0,1). Diagrammatically, we have that (X < x) 
if and only if [U < Fx (x)], an event of probability Fx (x). 

As long as we can invert the distribution function Fx (x) to get the inverse distribution function F^ 1 (a), 
the theorem assures us we can start with a pseudo-random uniform variable U and turn into a random 
variable F^ 1 (U), which has the required distribution Fx (x). 

Example 5.2 
The Exponential Distribution 



3 This content is available online at <http://cnx.Org/content/ml3113/l.3/>. 
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Consider the exponential distribution defined as 

l-e- A *,A>0,x>0, 

a = F x (x) = { 

0,x < 0. 

Then f or the inverse distribution function we have 



(5.11) 



1 



ln(l -a) = F _1 (a). 



(5.12) 



Thus if U is uniformly distributed on to 1, then X = — ^ln(l — U) has the distribution of an 
exponential random variable with parameter A. We say, for convenience, that X is exponential (A). 

note: If U is uniform (0,1), then so is (1-U), and the pair U and (1-U) are interchangeable in 
terms of distribution. Hence, X' = — j-ln (£/) is exponential. However, the two variables X and X' 
are correlated and are known as an antithetic pair. 

Example 5.3 
Normal and Gamma Distributions 

For both these cases there is no simple functional form for the inverse distribution F% (a), but 
because of the importance of the Normal and Gamma distribution models, a great deal of effort 
has been expended in deriving good approximations. 

The Normal distribution is defined through its density, 



fx(x) 



/2na 



exp 



2a 2 



(5.13) 



So that, 



Fx(x) 



/27TO- 



exp 



-(x — v,y 



dv. 



(5.14) 



The normal distribution function Fx {x) is also often denoted $ (x), when the parameter u and a 
are set to to 1, respectively. The distribution has no closed-form inverse, F% {a), but the inverse 
is needed do often that <1> _1 (a), like logarithms or exponentials, is a system function. 
The inverse of the Gamma distribution function, which is given by 



kx/u 



F x (x) 



T(k) 



..fc-i 



e v dv,x > 0,fc > 0,u > 0. 



(5.15) 



Is more difficult to compute because its shape changes radically with the value of k. It is however 
available on most computers as a numerically reliable function. 

Example 5.4 
The Normal and Gamma Distributions 

A commonly used symmetric distribution, which has a shape very much like that of the Normal 
distribution, is the standardized logistic distribution. 



F x (x) 
with probability density function 



1 



-oo < x < oo, 



(5.16) 



Fx{x) 



-oo < x < cxd. 



(5.17) 
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note: Fx (—00) = e~°° J (1 + e~°°) = and Fx (00) = 1 by using the second form for Fx (x). 

The inverse is obtained by setting a = j§—^- Then, a + ae x = e x or a = e x (1 — a) . 
Therefore, 

x = F^ (oe) = lna — In (1 — a) . 

And the random variable is generated, using the inverse probability integral method. As follows 

X = lnU -ln{l - U). 



5.3.1.2 The Discrete Case 

Let X have a discrete distribution Fx (x) that is, Fx (x) jumps at points Xk = 0, 1, 2, ... . Usually we have 
the case that Xk = k, so that X is an integer value. 
Let the probability function be denoted by 

Pk = P(X = x k ),k = 0,l,.... (5.18) 

The probability distribution function is then, 

F x (x k ) = P (X < x k ) = J2Pi> k = °< !< -' ( 5 - 19 ) 

j<k 

and the reliability or survivor function is 

Rx (x k ) = l-F x (x k ) = P(X>x k ),k = 0,l, •••• (5-20) 

The survivor function is sometimes easier to work with than the distribution function, and in fields such 
as reliability, it is habitually used. The inverse probability integral transform method of generating discrete 
random variables is based on the following theorem. 

THEOREM 

Let U be uniformly distributed in the interval (0,1). Set X = x k whenever Fx (xk-i) < U < Fx (xk), 
for k = 0, 1, 2, ... with Fx (x-i) = 0. Then X has probability function p k . 

PROOF 

By definition of the procedure, 

X = x k if and only if F x (xfc-i) < U < F x (x k ). 

Therefore, 

P (X = x k ) = PF X {{Xk-i) <U<F X (x k )) = F x (x k ) - F {x k -i) = Vk- (5.21) 

By the definition of the distribution function of a uniform (0,1) random variable. 

Thus the inverse probability integral transform algorithm for generating X is to find Xk such that U < 
Fx (xk) and U > Fx (xk-i) and then set X = x k . 

In the discrete case, there is never any problem of numerically computing the inverse distribution function, 
but the search to find the values Fx = (xk) and Fx (xk-i) between which U lies can be time-consuming, 
generally, sophisticated search procedures are required. In implementing this procedure, we try to minimize 
the number of times one compares U to Fx = (xk)- If we want to generate many of X, and Fx = (xk) is 
not easily computable, we may also want to store Fx = (xk) for all k rather than recomputed it. Then we 
have to worry about minimizing the total memory to store values of Fx = (x k )- 

Example 5.5 
The Binary Random Variable 

To generate a binary- valued random variable X that is 1 with probability p and with proba- 
bility 1-p, the algorithm is: 

• If U < p, set X=l. 
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• Else set X=0. 

Example 5.6 

The Discrete Uniform Random Variable 

Let X take on integer values between and including the integers a and b, where a < b, with 
equal probabilities. Since there are (b — a + 1) distinct values for X, the probability of getting any 
one of these values is, by definition, 1/ (b — a + 1). If we start with a continuous uniform (0,1) 
random number U, then the discrete inverse probability integral transform shows that 

X= integer part of [(6 — a + 1) U + a}. 

note: The continuous random variable [(6 — a + 1) U + a] is uniformly distributed in the open 
interval (a, b+ 1) . 

Example 5.7 
The Geometric Distribution 

Let X take values on zero and the positive integers with a geometric distribution. Thus, 

P(X = k)= Pk = {l- P)p\k = 0,1,2,....,0 < p < 1, (5.22) 

and 

P {X < k) = F x (k) = 1 - p k+1 , k = 0, 1, 2, ...., < p < 1. (5.23) 

To generate geometrically distributed random variables then, you can proceed successively accord- 
ing to the following algorithm: 

• Compute Fx (0) = 1 — p. Generate U. 

• If U < F x (0) set X=0 and exit. 

• Otherwise compute Fx (1) = 1 — p 2 . 

• If U < F x (1) set X=l, and exit. 

• Otherwise compute Fx (2), and so on. 
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A test, defined by a critical region C of size a, is a uniformly most powerful test if it is a 
most powerful test against each simple alternative in Hi. The critical region C is called a 
uniformly most powerful critical region of size a. 



1. Consider the test of the sample null hypothesis Ho : 8 = 8q against the simple alternative 
hypothesis Hi : 6 = 6\. 

2. Let C be a critical region of size a; that is, a = P (C; 6q). Then C is a best critical region of 
size a if, for every other critical region D of size a = P (D; 8q), we have that 

P(C;8i)>P(D;8i). 

CUMULATIVE DISTRIBUTION FUNCTION 

1. Let X be a random variable of the discrete type with space R and p.d.f. / (x) = P (X = x) , 
x g R. Now take x to be a real number and consider the set A of all points in R that are less 
than or equal to x. That is, A = (t : t < x) and t € R. 

2. Let define the function F(x) by 

F(x)=P(X<x) = J2f(t). (1.1) 

teA 

The function F(x) is called the distribution function (sometimes cumulative distribution 

function) of the discrete-type random variable X. 

D DEFINITION OF EXPONENTIAL DISTRIBUTION 

Let A = 1/8, then the random variable X has an exponential distribution and its p.d.f. id 
defined by 

f(x) = ^e-^ e ,0<x<^, (2.4) 

where the parameter 8 > 0. 
DEFINITION OF RANDOM VARIABLE 

1. Given a random experiment with a sample space S, a function X that assigns to each element s 
in S one and only one real number X (s) = x is called a random variable. The space of X is 
the set of real numbers {x : x = X (s) , s e S} , where s belongs to S means the element s 
belongs to the set S. 

2. It may be that the set S has elements that are themselves real numbers. In such an instance we 
could write X (s) = s so that X is the identity function and the space of X is also S. This is 
illustrated in the example below. 

DEFINITION OF UNIFORM DISTRIBUTION 
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The random variable X has a uniform distribution if its p.d.f. is equal to a constant on its 
support. In particular, if the support is the interval [a,b], then 

f(x) = j^—,a<x<b. (2.3) 

o = a 



G 



Given a random sample Xi,X2, ...,X„ from a normal distribution N (/x,<7 2 ), consider the 
closeness of X, the unbiased estimator of fi, to the unknown fi. To do this, the error structure 
(distribution) of X, namely that X is N (/i, cr 2 /n), is used in order to construct what is called a 
confidence interval for the unknown parameter ji, when the variance a 2 is known. 



If E [u (x\,X2, ...,£„)] = 9 is called an unbiased estimator of 9. Otherwise, it is said to be 
biased. 



1. If w < 0, then F (w) = and F' (w) = 0, a p.d.f. of this form is said to be one of the gamma 
type, and the random variable W is said to have the gamma distribution. 

2. The gamma function is defined by 



T(t)= y^e-vdy^Kt. 



Let X have a gamma distribution with 9 = 2 and a = r/2, where r is a positive integer. If the 
p.d.f. of X is 

/ (x) = —r-X — j7 i x r/2 - 1 e- x / 2 , < x < oo. (2.6) 

r(r/2)2 r / 2 

We say that X has chi-square distribution with r degrees of freedom, which we abbreviate 
by saying is x 2 (r). 

M MATHEMATICAL EXPECTATION 

If f(x) is the p.d.f. of the random variable X of the discrete type with space R and if the 
summation 

5>(aO/(a;) = 5>(aO/(aO (1.2) 

R x£R 

exists, then the sum is called the mathematical expectation or the expected value of the 
function u(X), and it is denoted by E [u (X)] . That is, 

25 [«(*)]= £>(*)/ (a). (1-3) 

R 

We can think of the expected value E [u (X)} as a weighted mean of u(x), x € R, where the 
weights are the probabilities / (cc) = P (X = x) . 

MATHEMATICAL EXPECTIATION 

If / (x) is the p.d.f. of the random variable X of the discrete type with space R and if the 
summation 



96 GLOSSARY 

o 

1. Once the sample is observed and the sample mean computed equal to x , the interval 

x - z a/2 (a/y/n) , x + z a/2 (a/Vn) 

is a known interval. Since the probability that the random interval covers /x before the sample is 
drawn is equal to 1 — a, call the computed interval, x± z Q / 2 (a/\/n) (for brevity), a 
100 (1 — a) % confidence interval for the unknown mean [i. 

2. The number 100 (1 — a) %, or equivalently, 1 — a, is called the confidence coefficient. 

P POISSON DISTRIBUTION 

We say that the random variable X has a Poisson distribution if its p.d.f. is of the form 

X x e~ x 
f(x) = i— , a: = 0,1, 2,..., 

x\ 

where A > 0. 

POISSON PROCCESS 

Let the number of changes that occur in a given continuous interval be counted. We have an 
approximate Poisson process with parameter A > if the following are satisfied: 

PROBABILITY DENSITY FUNCTION 

1. Function f(x) is a nonnegative function such that the total area between its graph and the x axis 
equals one. 

2. The probability P (a < X < b) is the area bounded by the graph of / (x) , the x axis, and the 
lines x = a and x = b . 

3. We say that the probability density function (p.d.f.) of the random variable X of the 
continuous type, with space R that is an interval or union of intervals, is an integrable function 
/ (x) satisfying the following conditions: 

• / (x) > , x belongs to R, 

• J I '{x)dx= 1, 

R 

• The probability of the event A belongs to R is P (X) e AJ f (x) dx. 

A 

PROBABILITY DENSITY FUNCTION 

1. The distribution function of a random variable X of the continuous type, is defined in terms of 
the p.d.f. of X, and is given by 



F (x) = P (X < x) = f (t) dt. 

— oo 

2. For the fundamental theorem of calculus we have, for x values for which the derivative F' (x) 
exists, that F'(x)=f(x). 

T t Distribution 

If Z is a random variable that is N (0, 1), if U is a random variable that is x 2 (>"), and if Z and 



U are independent, then 



Z X — /! 

^uj^ = W7^ (2 " 9) 



has a t distribution with r degrees of freedom. 
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1. The random variable X has a normal distribution if its p.d.f. is defined by 



/(*) 



1 



CTV27T 



exp 



2a 2 



-00 < X < 00, 



(2.S 



where \x and <r 2 are parameters satisfying — oo</i<oo,0<cr<oo, and also where exp [v] 
means e v . 

2. Briefly, we say that X is N (/z, cr 2 ) 
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H highly skewed, 48 

how large should the sample size be to 
estimate a mean?, 49 
Hypotheses Testing, § 4.5(76) 

I In general, 50 

is called an estimator of 8, 43 
it means be equal to 0, 43 

K k = 10 heads, 78 
k = 19 heads, 76 

L least squares estimation (LSE), 43 
Likelihood function, § 3.5(51) 
linear, 6 

M MATHEMATICAL EXPECTATION, 5, 
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§ 3.5(51), 53 
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N Neyman-Pearson Lemma, § 4.4(73) 
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